Weight Banding

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

Branch Specialization

Open up any ImageNet conv net and look at the weights in the last layer. You’ll find a uniform spatial pattern to them, dramatically unlike anything we see elsewhere in the network. No individual weight is unusual, but the uniformity is so striking that when we first discovered it we thought it must be a bug. Just as different biological tissue types jump out as distinct under a microscope, the weights in this final layer jump out as distinct when visualized with NMF. We call this phenomenon weight banding.

1. When visualized with NMF, the weight banding in layer mixed_5b is as visually striking compared to any other layer in InceptionV1 (here shown: mixed_3a) as the smooth, regular striation of muscle tissue is when compared to any other tissue (here shown: cardiac muscle tissue and epithelial tissue).

So far, the Circuits thread has mostly focused on studying very small pieces of neural network – individual neurons and small circuits. In contrast, weight banding is an example of what we call a “structural phenomenon,” a larger-scale pattern in the circuits and features of a neural network. Other examples of structural phenomena are the recurring symmetries we see in equivariance motifs and the specialized slices of neural networks we see in branch specialization. In the case of weight banding, we think of it as a structural phenomenon because the pattern appears at the scale of an entire layer.

In addition to describing weight banding, we’ll explore when and why it occurs. We find that there appears to be a causal link between whether a model uses global average pooling or fully connected layers at the end, suggesting that weight banding is part of an algorithm for preserving information about larger scale structure in images. Establishing causal links like this is a step towards closing the loop between practical decisions in training neural networks and the phenomena we observe inside them.

Where weight banding occurs

Weight banding consistently forms in the final convolutional layer of vision models with global average pooling.

In order to see the bands, we need to visualize the spatial structure of the weights, as shown below. We typically do this using NMF, as described in Visualizing Weights. For each neuron, we take the weights connecting it to the previous layer. We then use NMF to reduce the number of dimensions corresponding to channels in the previous layer down to 3 factors, which we can map to RGB channels. Since which factor is which is arbitrary, we use a heuristic to make the mapping consistent across neurons. This reveals a very prominent pattern of horizontalThe stripes aren’t always perfectly horizontal - sometimes they exhibit a slight preference for extra weight in the center of the central band, as seen in some examples below. stripes.

2. These common networks have pooling operations before their fully connected layers and consistently show banding at their last convolutional layers.

InceptionV1
mixed 5b

ResNet50
block 4 unit 3

VGG19
conv5

Interestingly, AlexNet does not exhibit this phenomenon.

3. AlexNet does not have a pooling operation before its fully connected layers and does not show banding at its last convolutional layer.

To make it easier to look for groups of similar weights, we sorted the neurons at each layer by similarity of their reduced forms.

AlexNet
conv5

Unlike most modern vision models, AlexNet does not use global average pooling. Instead, it has a fully connected layer directly connected to its final convolutional layer, allowing it to treat different positions differently. If one looks at the weights of this fully connected layer, the weights strongly vary as a function of the global y position.

The horizontal stripes in weight banding mean that the filters don’t care about horizontal position, but are strongly encoding relative vertical position. Our hypothesis is that weight banding is a learned way to preserve spatial information as it gets lost through various pooling operations.

In the next section, we will construct our own simplified vision network and investigate variations on its architecture in order to understand exactly which conditions are necessary to produce weight banding.

What affects banding

We’d like to understand which architectural decisions affect weight banding. This will involve trying out different architectures and seeing whether weight banding persists. Since we will only want to change a single architectural parameter at a time, we will need a consistent baseline to apply our modifications to. Ideally, this baseline would be as simple as possible.

We created a simplified network architecture with 6 groups of convolutions, separated by L2 pooling layers. At the end, it has a global average pooling operation that reduces the input to 512 values that are then fed to a fully connected layer with 1001 outputs.

4. Our simplified vision network architecture.

This simplified network reliably produces weight banding in its last layer (and usually in the two preceding layers as well).

5. NMF of the weights in the last layer of the simplified model shows clear weight banding.

simplified model (5b), baseline

In the rest of this section, we’ll experiment with modifying this architecture and its training settings and seeing if weight banding is preserved.

Rotating images 90 degrees

To rule out bugs in training or some strange numerical problem, we decided to do a training run with the input rotated by 90 degrees. This sanity check yielded a very clear result showing vertical banding in the resulting weights, instead of horizontal banding. This is a clear indication that banding is a result of properties within the ImageNet dataset which make spatial vertical position(or, in the case of the rotated dataset, spatial horizontal position) relevant.

6. simplified model (5b), 90º rotation

Fully connected layer without global average pooling

We remove the global average pooling step in our simplified model, allowing the fully connected layer to see all spatial positions at once. This model did not exhibit weight banding, but used 49x more parameters in the fully connected layer and overfit to the training set. This is pretty strong evidence that the use of aggressive pooling after the last convolutions in common models causes weight banding. This result is also consistent with AlexNet not showing this banding phenomenon (since it also does not have global average pooling).

7. simplified model (5b), no pooling before fully connected layer

Average pooling along x-axis only

We average out each row of the final convolutional layer, so that vertical absolute position is preserved but horizontal absolute position is not.Since this model has 7x7 spatial positions in the final convolutional layer, this modification increases the number of parameters in the fully connected layer by 7x, but not the 49x of a complete fully connected layer with no pooling at all. The banding at the last layer seems to go away, but on closer investigation, clear banding is still visible in layer 5a, similar to the baseline model’s 5b. We found this result surprising.

8. NMF of weights in 5a and 5b in a version of the simplified model modified to have pooling only along the x-axis. Banding is gone from 5b but reappears in 5a!

simplified model (5a), x-axis pooling

simplified model (5b), x-axis pooling

Approaches where weight banding persisted

We tried each of the modifications below, and found that weight banding was still present in each of these variants.

Global average pooling with learned spatial masks. By applying several different spatial masks and global average pooling, we can allow the model to preserve some spatial information. Intuitively, each mask can select for a different subset of spatial positions. We tried experimental runs using each of 3, 5, or 16 different masks. The masks that were learned corresponded to large-scale global structure, but banding was still strongly present.
Using an attention layer instead of pooling/fully connected combination after layer 5b.
Adding a 7x7x512 mask with learned weights after 5b. The hope was that a mask would help each 5b neuron focus on the right parts of the 7x7 image without a convolution.
Adding CoordConv channels to the inputs of 5a and 5b.
Splitting the output of 5b into 16 7x7x32 channel groups and feeding each group its own fully connected layer. The output of the 16 fully connected layers is then concatenated into the input of the final 1001-class fully connected layer.
Using a global max pool, 4096-unit fully connected layer, then 1001-unit fully connected layer (inspired by VGG).

An interactive diagram allowing you to explore the weights for these experiments and more can be found in the appendix.

Confirming banding interventions in common architectures

In the previous section, we observed two interventions that clearly affected weight banding: rotating the dataset by 90º and removing the global average pooling before the fully connected layer. To confirm that these effects hold beyond our simplified model, we decided to make the same interventions to three common architectures (InceptionV1, ResNet50, VGG19) and train them from scratch.

With one exception, the effect holds in all three models.

InceptionV1

9. Inception V1, layer mixed_5c, 5x5 convolution

baseline

90º rotation

global average pooling layer removed

ResNet50

10. ResNet50, last 3x3 convolutional layer

baseline

90º rotation

global average pooling layer removed

VGG19

11. VGG19, last 3x3 convolutional layer.

baseline

90º rotation

global average pooling layer removed

The one exception is VGG19, where the removal of the pooling operation before its set of fully connected layers did not eliminate weight banding as expected; these weights look fairly similar to the baseline. However, it clearly responds to rotation.

Conclusion

Once we really understand neural networks, one would expect us to be able to leverage that understanding to design more effective neural networks architectures. Early papers, like Zeiler et al, emphasized this quite strongly, but it’s unclear whether there have yet been any significant successes in doing this. This hints at significant limitations in our work. It may also be a missed opportunity: it seems likely that if interpretability was useful in advancing neural network capabilities, it would become more integrated into other research and get attention from a wider range of researchers.

It’s unclear whether weight banding is “good” or “bad.”On one hand, the 90º rotation experiment shows that weight banding is a product of the dataset and is encoding useful information into the weights. However, if spatial information could flow through the network in a different, more efficient way, then perhaps the channels would be able to focus on encoding relationships between features without needing to track spatial positions. We don’t have any recommendation or action to take away from it. However, it is an example of a consistent link between architecture decisions and the resulting trained weights. It has the right sort of flavor for something that could inform architectural design, even if it isn’t particularly actionable itself.

More generally, weight banding is an example of a large-scale structure. One of the major limitations of circuits has been how small-scale it is. We’re hopeful that larger scale structures like weight banding may help circuits form a higher-level story of neural networks.

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

Branch Specialization

Technical Notes

Training the simplified network

The simplified network used to study this phenomenon was trained on Imagenet (1.2 million images) for 90 epochs. Training was done on 8 GPUs with a global batch size of 512 for the first 30 epochs and 1024 for the remaining 60 epochs. The network was built using TF-Slim. Batch norm was used on convolutional layers and fully connected layers, except for the last fully connected layer with 1001 outputs.

12. Types of banding across different experiments.

To explore how layer weights are affected by the various attempts to affect banding, we clustered a normalized form of the weights in the experiments discussed above. In this figure, you can explore how the proportion and type of banding changes with the various experiments.

Highlighted labels indicate experiments where weight banding no longer persisted for the given intervention and layer.

Follow up experiment ideas

The following experiments were discussed in various conversations but have not been run at this time:

Using x-pooling and y-pooling together before the fully connected layer to present a lossy form of spatial positions to the fully connected layer. (Alec Radford’s suggestion)
Rotating the input randomly acts as a regularization technique to induce no banding? (it would likely work but hurt performance)

Author Contributions

As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.

Research. Ludwig Schubert accidentally discovered weight banding, thinking it was a bug. Michael Petrov performed an array of systematic investigations into when it occurs and how architectural decisions affect it. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Chris Olah, and Ludwig.

Writing and Diagrams. Michael wrote and illustrated a first version of this article. Chelsea improved the text and illustrations, and thought about big picture framing. Chris helped with editing.

Acknowledgments

We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Vincent Tjeng, and David Valdman for their remarks on a first draft.

References

Muscle Tissue: Cardiac Muscle [link]
Library, B.C.C.B.I., 2018.
Epithelial Tissues: Stratified Squamous Epithelium [link]
Library, B.C.C.B.I., 2018.
Deconvolution and Checkerboard Artifacts [link]
Odena, A., Dumoulin, V. and Olah, C., 2016. Distill. DOI: 10.23915/distill.00003
Going Deeper with Convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1--9.
Deep Residual Learning for Image Recognition
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778.
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K. and Zisserman, A., 2014. arXiv preprint arXiv:1409.1556.
ImageNet Classification with Deep Convolutional Neural Networks [PDF]
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, pp. 1097--1105.
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution [PDF]
Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A. and Yosinski, J., 2018. CoRR, Vol abs/1807.03247.
Visualizing and Understanding Convolutional Networks
Zeiler, M.D. and Fergus, R., 2014. European conference on computer vision, pp. 818--833.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution in academic contexts, please cite this work as

Petrov, et al., "Weight Banding", Distill, 2021.

BibTeX citation

@article{petrov2021weight,
  author = {Petrov, Michael and Voss, Chelsea and Schubert, Ludwig and Cammarata, Nick and Goh, Gabriel and Olah, Chris},
  title = {Weight Banding},
  journal = {Distill},
  year = {2021},
  note = {https://distill.pub/2020/circuits/weight-banding},
  doi = {10.23915/distill.00024.009}
}

[wikitissue2] Muscle Tissue: Cardiac Muscle [link]
Library, B.C.C.B.I., 2018.

[wikitissue1] Epithelial Tissues: Stratified Squamous Epithelium [link]
Library, B.C.C.B.I., 2018.

[odena2016deconvolution] Deconvolution and Checkerboard Artifacts [link]
Odena, A., Dumoulin, V. and Olah, C., 2016. Distill. DOI: 10.23915/distill.00003

[szegedy2015going] Going Deeper with Convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1--9.

[he2016deep] Deep Residual Learning for Image Recognition
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778.

[simonyan2014very] Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K. and Zisserman, A., 2014. arXiv preprint arXiv:1409.1556.

[krizhevsky2012] ImageNet Classification with Deep Convolutional Neural Networks [PDF]
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, pp. 1097--1105.

[COORDCONV] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution [PDF]
Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A. and Yosinski, J., 2018. CoRR, Vol abs/1807.03247.

[zeiler2014visualizing] Visualizing and Understanding Convolutional Networks
Zeiler, M.D. and Fergus, R., 2014. European conference on computer vision, pp. 818--833.

Weight Banding

Authors

Affiliations

Published

DOI

Introduction

Where weight banding occurs

What affects banding

Rotating images 90 degrees

Fully connected layer without global average pooling

Average pooling along x-axis only

Approaches where weight banding persisted

Confirming banding interventions in common architectures

InceptionV1

ResNet50

VGG19

Conclusion

Technical Notes

Training the simplified network

Follow up experiment ideas

Author Contributions

Acknowledgments

References

Updates and Corrections

Reuse

Citation