Branch Specialization

Voss, Chelsea; Goh, Gabriel; Cammarata, Nick; Petrov, Michael; Schubert, Ludwig; Olah, Chris

doi:10.23915/distill.00024.008

If we think of interpretability as a kind of “anatomy of neural networks,” most of the circuits thread has involved studying tiny little veins – looking at the small-scale, at individual neurons and how they connect. However, there are many natural questions that the small-scale approach doesn’t address.

In contrast, the most prominent abstractions in biological anatomy involve larger-scale structures: individual organs like the heart, or entire organ systems like the respiratory system. And so we wonder: is there a “respiratory system” or “heart” or “brain region” of an artificial neural network? Do neural networks have any emergent structures that we could study that are larger-scale than circuits?

This article describes branch specialization, one of three larger “structural phenomena” we’ve been able observe in neural networks. (The other two, equivariance and weight banding, have separate dedicated articles.) Branch specialization occurs when neural network layers are split up into branches. The neurons and circuits tend to self-organize, clumping related functions into each branch and forming larger functional units – a kind of “neural network brain region.” We find evidence that these structures implicitly exist in neural networks without branches, and that branches are simply reifying structures that otherwise exist.

The earliest example of branch specialization that we’re aware of comes from AlexNet. AlexNet is famous as a jump in computer vision, arguably starting the deep learning revolution, but buried in the paper is a fascinating, rarely-discussed detail. The first two layers of AlexNet are split into two branches which can’t communicate until they rejoin after the second layer. This structure was used to maximize the efficiency of training the model on two GPUs, but the authors noticed something very curious happened as a result. The neurons in the first layer organized themselves into two groups: black-and-white Gabor filters formed on one branch and low-frequency color detectors formed on the other branch.

Although the first layer of AlexNet is the only example of branch specialization we’re aware of being discussed in the literature, it seems to be a common phenomenon. We find that branch specialization happens in later hidden layers, not just the first layer. It occurs in both low-level and high-level features. It occurs in a wide range of models, including places you might not expect it – for example, residual blocks in resnets can functionally be branches and specialize. Finally, branch specialization appears to surface as a structural phenomenon in plain convolutional nets, even without any particular structure causing it.

Is there a large-scale structure to how neural networks operate? How are features and circuits organized within the model? Does network architecture influence the features and circuits that form? Branch specialization hints at an exciting story related to all of these questions.

What is a branch?

Many neural network architectures have branches, sequences of layers which temporarily don’t have access to “parallel” information which is still passed to later layers.

In the past, models with explicitly-labeled branches were popular (such as AlexNet and the Inception family of networks). In more recent years, these have become less common, but residual networks – which can be seen as implicitly having branches in their residual blocks – have become very common. We also sometimes see branched architectures develop automatically in neural architecture search, an approach where the network architecture is learned.

The implicit branching of residual networks has some important nuances. At first glance, every layer is a two-way branch. But because the branches are combined together by addition, we can actually rewrite the model to reveal that the residual blocks can be understood as branches in parallel:

We typically see residual blocks specialize in very deep residual networks (e.g. ResNet-152). One hypothesis for why is that, in these models, the exact depth of a layer doesn’t matter and the branching aspect becomes more important than the sequential aspect.

One of the conceptual weaknesses of normal branching models is that although branches can save parameters, it still requires a lot of parameters to mix values between branches. However, if you buy the branch interpretation of residual networks, you can see them as a strategy to sidestep this: residual networks intermix branches (e.g. block sparse weights) with low-rank connections (projecting all the blocks into the same sum and then back up). This seems like a really elegant way to handle branching. More practically, it suggests that analysis of residual networks might be well-served by paying close attention to the units in the blocks, and that we might expect the residual stream to be unusually polysemantic.

Why does branch specialization occur?

Branch specialization is defined by features organizing between branches. In a normal layer, features are organized randomly: a given feature is just as likely to be any neuron in a layer. But in a branched layer, we often see features of a given type cluster to one branch. The branch has specialized on that type of feature.

How does this happen? Our intuition is that there’s a positive feedback loop during training.

Another way to think about this is that if you need to cut a neural network into pieces that have limited ability to communicate with each other, it makes sense to organize similar features close together, because they probably need to share more information.

Branch specialization beyond the first layer

So far, the only concrete example we’ve shown of branch specialization is the first and second layer of AlexNet. What about later layers? AlexNet also splits its later layers into branches, after all. This seems to be unexplored, since studying features after the first layer is much harder.For the first layer, one can visualize the RGB weights; for later layers, one needs to use feature visualization.

Unfortunately, branch specialization in the later layers of AlexNet is also very subtle. Instead of one overall split, it’s more like there’s dozens of small clusters of neurons, each cluster being assigned to a branch. It’s hard to be confident that one isn’t just seeing patterns in noise.

But other models have very clear branch specialization in later layers. This tends to happen when a branch constitutes only a very small fraction of a layer, either because there are many branches or because one is much smaller than others. In these cases, the branch can specialize on a very small subset of the features that exist in a layer and reveal a clear pattern.

For example, most of InceptionV1′s layers have a branched structure. The branches have varying numbers of units, and varying convolution sizes. The 5x5 branch is the smallest branch, and also has the largest convolution size. It’s often very specialized:

This is exceptionally unlikely to have occurred by chance. For example, all 9 of the black and white vs. color detectors in mixed3a are in mixed3a_5x5, despite it only being 32 out of the 256 neurons in the layer. The probability of that happening by chance is less than 1/10⁸. For a more extreme example, all 30 of the curve-related features in mixed3b are in mixed3b_5x5, despite it being only 96 out of the 480 neurons in the layer. The probability of that happening by chance is less than 1/10²⁰.

It’s worth noting one confounding factor which might be influencing the specialization. The 5x5 branches are the smallest branches, but also have larger convolutions (5x5 instead of 3x3 or 1x1) than their neighbors.There is something which suggests that the branching plays an essential role: mixed3a and mixed3b are adjacent layers which contain relatively similar features and are at the same scale. If it was only about convolution size, why don’t we see any curves in the mixed3a_5x5 branch or color in the mixed3b_5x5 branch?

Why is branch specialization consistent?

Perhaps the most surprising thing about branch specialization is that the same branch specializations seem to occur again and again, across different architectures and tasks.

For example, the branch specialization we observed in AlexNet – the first layer specializing into a black-and-white Gabor branch vs. a low-frequency color branch – is a surprisingly robust phenomenon. It occurs consistently if you retrain AlexNet. It also occurs if you train other architectures with the first few layers split into two branches. It even occurs if you train those models on other natural image datasets, like Places instead of ImageNet. Anecdotally, we also seem to see other types of branch specialization recur. For example, finding branches that seem to specialize in curve detection seems to be quite common (although InceptionV1′s mixed3b_5x5 is the only one we’ve carefully characterized).

One hypothesis seems very tempting. Notice that many of the same features that form in normal, non-branched models also seem to form in branched models. For example, the first layer of both branched and non-branched models contain Gabor filters and color features. If the same features exist, presumably the same weights exist between them.

Could it be that branching is just surfacing a structure that already exists? Perhaps there’s two different subgraphs between the weights of the first and second conv layer in a normal model, with relatively small weights between them, and when you train a branched model these two subgraphs latch onto the branches. (This would be directionally similar to work finding modular substructures within neural networks.)

To test this, let’s look at models which have non-branched first and second convolutional layers. Let’s take the weights between them and perform a singular value decomposition (SVD) on the absolute values of the weights. This will show us the main factors of variation in which neurons connect to different neurons in the next layer (irrespective of whether those connections are excitatory or inhibitory).

Sure enough, the singular vector (the largest factor of variation) of the weights between the first two convolutional layers of InceptionV1 is color.

We also see that the second factor appears to be frequency. This suggests an interesting prediction: perhaps if we were to split the layer into more than two branches, we’d also observe specialization in frequency in addition to color.

This seems like it may be true. For example, here we see a high-frequency black-and-white branch, a mid-frequency mostly black-and-white branch, a mid-frequency color branch, and a low-frequency color branch.

Parallels to neuroscience

We’ve shown that branch specialization is one example of a structural phenomenon — a larger-scale structure in a neural network. It happens in a variety of situations and neural network architectures, and it happens with consistency – certain motifs of specialization, such as color, frequency, and curves, happen consistently across different architectures and tasks.

Returning to our comparison with anatomy, although we hesitate to claim explicit parallels to neuroscience, it’s tempting to draw analogies between branch specialization and the existence of regions of the brain focused on particular tasks. The visual cortex, the auditory cortex, Broca’s area and Wernicke’s area The subspecialization within the V2 area of the primate visual cortex is another strong example from neuroscience. One type of stripe within V2 is sensitive to orientation or luminance, whereas the other type of stripe contains color-selective neurons. We are grateful to Patrick Mineault for noting this analogy, and for further noting that the high-frequency features are consistent with some of the known representations of high-level features in the primate V2 area. – these are all examples of brain areas with such consistent specialization across wide populations of people that neuroscientists and psychologists have been able to characterize as having remarkably consistent functions. As researchers without expertise in neuroscience, we’re uncertain how useful this connection is, but it may be worth considering whether branch specialization can be a useful model of how specialization might emerge in biological neural networks.

Comment

Matthew Nolan is Professor of Neural Circuits and Computation at the Centre for Discovery Brain Sciences and Simons Initiative for the Developing Brain, University of Edinburgh. Ian Hawes is a PhD student in the Wellcome Trust programme in Translational Neuroscience at the University of Edinburgh.

As neuroscientists we’re excited by this work as it offers fresh theoretical perspectives on long-standing questions about how brains are organised and how they develop. Branching and specialisation are found throughout the brain. A well studied example is the dorsal and ventral visual streams, which are associated with spatial and non-spatial visual processing. At the microcircuit level neurons in each pathway are similar. However, recordings of neural activity demonstrate remarkable specialisation; classic experiments from the 1970s and 80s established the idea that the ventral stream enables identification of objects whereas the dorsal stream represents their location. Since then, much has been learned about signal processing in these pathways but fundamental questions such as why there are multiple streams and how they are established remain unanswered.

From the perspective of a neuroscientist, a striking result from the investigation of branch specialization by Voss and her colleagues is that robust branch specialisation emerges in the absence of any complex branch specific design rules. Their analyses show that specialisation is similar within and across architectures, and across different training tasks. The implication here is that no specific instructions are required for branch specialisation to emerge. Indeed, their analyses suggest that it even emerges in the absence of predetermined branches. By contrast, the intuition of many neuroscientists would be that specialisation of different areas of the neocortex requires developmental mechanisms that are specific to each area. For neuroscientists aiming to understand how perceptual and cognitive functions of the brain arise, an important idea here is that developmental mechanisms that drive the separation of cortical pathways, such as the dorsal and ventral visual streams, may be absolutely critical.

While the parallels between branch specialization in artificial neural networks and neural circuits in the brain are striking, there are clearly major differences and many outstanding questions. From the perspective of building artificial neural networks, we wonder if branch specific tuning of individual units and their connectivity rules would enhance performance? In the brain, there is good evidence that the activation functions of individual neurons are fine-tuned between and even within distinct neural circuits. If this fine tuning confers benefits to the brain then we might expect similar benefits in artificial networks. From the perspective of understanding the brain, we wonder whether branch specialisation could help make experimentally testable predictions? If artificial networks can be engineered with branches that have organisation similar to branching pathways in the brain, then manipulations to these networks could be compared to experimental manipulations achieved with optogenetic and chemogenetic strategies. Given that many brain disorders involve changes to specific neural populations, similar strategies could give insights into how these pathological changes alter brain functions. For example, very specific populations of neurons are disrupted in early stages of Alzheimer’s disease. By disrupting corresponding units in neural network models one could explore the resulting computational deficits and possible strategies for restoration of cognitive functions.

[krizhevsky2012imagenet] Imagenet classification with deep convolutional neural networks
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Advances in neural information processing systems, Vol 25, pp. 1097--1105.

[erhan2009visualizing] Visualizing higher-layer features of a deep network [PDF]
Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341, pp. 3.

[simonyan2013deep] Deep inside convolutional networks: Visualising image classification models and saliency maps
Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034.

[nguyen2016multifaceted] Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks [PDF]
Nguyen, A., Yosinski, J. and Clune, J., 2016. arXiv preprint arXiv:1602.03616.

[olah2017feature] Feature Visualization [link]
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007

[szegedy2015going] Going deeper with convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1--9.

[zoph2016neural] Neural architecture search with reinforcement learning
Zoph, B. and Le, Q.V., 2016. arXiv preprint arXiv:1611.01578.

[filan2020neural] Neural networks are surprisingly modular
Filan, D., Hod, S., Wild, C., Critch, A. and Russell, S., 2020. arXiv preprint arXiv:2003.04881.

[csordas2020neural] Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
Csordás, R., Steenkiste, S.v. and Schmidhuber, J., 2020.

[Hubel3378] Segregation of form, color, and stereopsis in primate area 18 [link]
Hubel, D. and Livingstone, M., 1987. Journal of Neuroscience, Vol 7(11), pp. 3378--3415. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.07-11-03378.1987

[Ito3313] Representation of Angles Embedded within Contour Stimuli in Area V2 of Macaque Monkeys [link]
Ito, M. and Komatsu, H., 2004. Journal of Neuroscience, Vol 24(13), pp. 3313--3324. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.4364-03.2004

Branch Specialization

Authors

Affiliations

Published

DOI

Introduction

What is a branch?

Why does branch specialization occur?