If we think of interpretability as a kind of “anatomy of neural networks,” most of the circuits thread has involved studying tiny little veins – looking at the small-scale, at individual neurons and how they connect. However, there are many natural questions that the small-scale approach doesn’t address.
In contrast, the most prominent abstractions in biological anatomy involve larger-scale structures: individual organs like the heart, or entire organ systems like the respiratory system. And so we wonder: is there a “respiratory system” or “heart” or “brain region” of an artificial neural network? Do neural networks have any emergent structures that we could study that are larger-scale than circuits?
This article describes branch specialization, one of three larger “structural phenomena” we’ve been able observe in neural networks. (The other two, equivariance and weight banding, have separate dedicated articles.) Branch specialization occurs when neural network layers are split up into branches. The neurons and circuits tend to self-organize, clumping related functions into each branch and forming larger functional units – a kind of “neural network brain region.” We find evidence that these structures implicitly exist in neural networks without branches, and that branches are simply reifying structures that otherwise exist.
The earliest example of branch specialization that we’re aware of comes from AlexNet
Although the first layer of AlexNet is the only example of branch specialization we’re aware of being discussed in the literature, it seems to be a common phenomenon. We find that branch specialization happens in later hidden layers, not just the first layer. It occurs in both low-level and high-level features. It occurs in a wide range of models, including places you might not expect it – for example, residual blocks in resnets can functionally be branches and specialize. Finally, branch specialization appears to surface as a structural phenomenon in plain convolutional nets, even without any particular structure causing it.
Is there a large-scale structure to how neural networks operate? How are features and circuits organized within the model? Does network architecture influence the features and circuits that form? Branch specialization hints at an exciting story related to all of these questions.
Many neural network architectures have branches, sequences of layers which temporarily don’t have access to “parallel” information which is still passed to later layers.
In the past, models with explicitly-labeled branches were popular (such as AlexNet and the Inception family of networks
The implicit branching of residual networks has some important nuances. At first glance, every layer is a two-way branch. But because the branches are combined together by addition, we can actually rewrite the model to reveal that the residual blocks can be understood as branches in parallel:
We typically see residual blocks specialize in very deep residual networks (e.g. ResNet-152). One hypothesis for why is that, in these models, the exact depth of a layer doesn’t matter and the branching aspect becomes more important than the sequential aspect.
One of the conceptual weaknesses of normal branching models is that although branches can save parameters, it still requires a lot of parameters to mix values between branches. However, if you buy the branch interpretation of residual networks, you can see them as a strategy to sidestep this: residual networks intermix branches (e.g. block sparse weights) with low-rank connections (projecting all the blocks into the same sum and then back up). This seems like a really elegant way to handle branching. More practically, it suggests that analysis of residual networks might be well-served by paying close attention to the units in the blocks, and that we might expect the residual stream to be unusually polysemantic.
Branch specialization is defined by features organizing between branches. In a normal layer, features are organized randomly: a given feature is just as likely to be any neuron in a layer. But in a branched layer, we often see features of a given type cluster to one branch. The branch has specialized on that type of feature.
How does this happen? Our intuition is that there’s a positive feedback loop during training.
Another way to think about this is that if you need to cut a neural network into pieces that have limited ability to communicate with each other, it makes sense to organize similar features close together, because they probably need to share more information.
So far, the only concrete example we’ve shown of branch specialization is the first and second layer of AlexNet. What about later layers? AlexNet also splits its later layers into branches, after all. This seems to be unexplored, since studying features after the first layer is much harder.
Unfortunately, branch specialization in the later layers of AlexNet is also very subtle. Instead of one overall split, it’s more like there’s dozens of small clusters of neurons, each cluster being assigned to a branch. It’s hard to be confident that one isn’t just seeing patterns in noise.
But other models have very clear branch specialization in later layers. This tends to happen when a branch constitutes only a very small fraction of a layer, either because there are many branches or because one is much smaller than others. In these cases, the branch can specialize on a very small subset of the features that exist in a layer and reveal a clear pattern.
For example, most of InceptionV1′s layers have a branched structure. The branches have varying numbers of units, and varying convolution sizes. The 5x5 branch is the smallest branch, and also has the largest convolution size. It’s often very specialized:
This is exceptionally unlikely to have occurred by chance.
For example, all 9 of the black and white vs. color detectors in mixed3a
are in mixed3a_5x5
, despite it only being 32 out of the 256 neurons in the layer. The probability of that happening by chance is less than 1/108. For a more extreme example, all 30 of the curve-related features in mixed3b
are in mixed3b_5x5
, despite it being only 96 out of the 480 neurons in the layer. The probability of that happening by chance is less than 1/1020.
It’s worth noting one confounding factor which might be influencing the specialization. The 5x5 branches are the smallest branches, but also have larger convolutions (5x5 instead of 3x3 or 1x1) than their neighbors.mixed3a_5x5
branch or color in the mixed3b_5x5
branch?
Perhaps the most surprising thing about branch specialization is that the same branch specializations seem to occur again and again, across different architectures and tasks.
For example, the branch specialization we observed in AlexNet – the first layer specializing into a black-and-white Gabor branch vs. a low-frequency color branch – is a surprisingly robust phenomenon. It occurs consistently if you retrain AlexNet. It also occurs if you train other architectures with the first few layers split into two branches. It even occurs if you train those models on other natural image datasets, like Places instead of ImageNet. Anecdotally, we also seem to see other types of branch specialization recur. For example, finding branches that seem to specialize in curve detection seems to be quite common (although InceptionV1′s mixed3b_5x5
is the only one we’ve carefully characterized).
So, why do the same branch specializations occur again and again?
One hypothesis seems very tempting. Notice that many of the same features that form in normal, non-branched models also seem to form in branched models. For example, the first layer of both branched and non-branched models contain Gabor filters and color features. If the same features exist, presumably the same weights exist between them.
Could it be that branching is just surfacing a structure that already exists? Perhaps there’s two different subgraphs between the weights of the first and second conv layer in a normal model, with relatively small weights between them, and when you train a branched model these two subgraphs latch onto the branches.
(This would be directionally similar to work finding modular substructures
To test this, let’s look at models which have non-branched first and second convolutional layers. Let’s take the weights between them and perform a singular value decomposition (SVD) on the absolute values of the weights. This will show us the main factors of variation in which neurons connect to different neurons in the next layer (irrespective of whether those connections are excitatory or inhibitory).
Sure enough, the singular vector (the largest factor of variation) of the weights between the first two convolutional layers of InceptionV1 is color.
We also see that the second factor appears to be frequency. This suggests an interesting prediction: perhaps if we were to split the layer into more than two branches, we’d also observe specialization in frequency in addition to color.
This seems like it may be true. For example, here we see a high-frequency black-and-white branch, a mid-frequency mostly black-and-white branch, a mid-frequency color branch, and a low-frequency color branch.
We’ve shown that branch specialization is one example of a structural phenomenon — a larger-scale structure in a neural network. It happens in a variety of situations and neural network architectures, and it happens with consistency – certain motifs of specialization, such as color, frequency, and curves, happen consistently across different architectures and tasks.
Returning to our comparison with anatomy, although we hesitate to claim explicit parallels to neuroscience, it’s tempting to draw analogies between branch specialization and the existence of regions of the brain focused on particular tasks.
The visual cortex, the auditory cortex, Broca’s area and Wernicke’s area
As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.
Research. The phenomenon of branch specialization was initially observed by Chris Olah. Chris also developed the weight PCA experiments suggesting that it implicitly occurs in non-branched models. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Ludwig Schubert, and Chris. Chelsea and Nick contributed to framing this work in the importance of larger scale structures on top of circuits.
Infrastructure. Branch specialization was only discovered because an early version of Microscope by Ludwig Schubert made it easy to browse the neurons that exist at certain layers. Michael Petrov, Ludwig and Nick built a variety of infrastructural tools which made our research possible.
Writing and Diagrams. Chelsea wrote the article, based on an initial article by Chris and with Chris’s help. Diagrams were illustrated by both Chelsea and Chris.
We are grateful to Brice Ménard for pushing us to investigate whether we can find larger-scale structures such as the one investigated here. We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Matt Nolan, and Vincent Tjeng for their remarks on a first draft. We’re grateful to Patrick Mineault for noting the neuroscience comparison to subspecialization within primate V2.
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Voss, et al., "Branch Specialization", Distill, 2021.
BibTeX citation
@article{voss2021branch, author = {Voss, Chelsea and Goh, Gabriel and Cammarata, Nick and Petrov, Michael and Schubert, Ludwig and Olah, Chris}, title = {Branch Specialization}, journal = {Distill}, year = {2021}, note = {https://distill.pub/2020/circuits/branch-specialization}, doi = {10.23915/distill.00024.008} }
Comment
As neuroscientists we’re excited by this work as it offers fresh theoretical perspectives on long-standing questions about how brains are organised and how they develop. Branching and specialisation are found throughout the brain. A well studied example is the dorsal and ventral visual streams, which are associated with spatial and non-spatial visual processing. At the microcircuit level neurons in each pathway are similar. However, recordings of neural activity demonstrate remarkable specialisation; classic experiments from the 1970s and 80s established the idea that the ventral stream enables identification of objects whereas the dorsal stream represents their location. Since then, much has been learned about signal processing in these pathways but fundamental questions such as why there are multiple streams and how they are established remain unanswered.
From the perspective of a neuroscientist, a striking result from the investigation of branch specialization by Voss and her colleagues is that robust branch specialisation emerges in the absence of any complex branch specific design rules. Their analyses show that specialisation is similar within and across architectures, and across different training tasks. The implication here is that no specific instructions are required for branch specialisation to emerge. Indeed, their analyses suggest that it even emerges in the absence of predetermined branches. By contrast, the intuition of many neuroscientists would be that specialisation of different areas of the neocortex requires developmental mechanisms that are specific to each area. For neuroscientists aiming to understand how perceptual and cognitive functions of the brain arise, an important idea here is that developmental mechanisms that drive the separation of cortical pathways, such as the dorsal and ventral visual streams, may be absolutely critical.
While the parallels between branch specialization in artificial neural networks and neural circuits in the brain are striking, there are clearly major differences and many outstanding questions. From the perspective of building artificial neural networks, we wonder if branch specific tuning of individual units and their connectivity rules would enhance performance? In the brain, there is good evidence that the activation functions of individual neurons are fine-tuned between and even within distinct neural circuits. If this fine tuning confers benefits to the brain then we might expect similar benefits in artificial networks. From the perspective of understanding the brain, we wonder whether branch specialisation could help make experimentally testable predictions? If artificial networks can be engineered with branches that have organisation similar to branching pathways in the brain, then manipulations to these networks could be compared to experimental manipulations achieved with optogenetic and chemogenetic strategies. Given that many brain disorders involve changes to specific neural populations, similar strategies could give insights into how these pathological changes alter brain functions. For example, very specific populations of neurons are disrupted in early stages of Alzheimer’s disease. By disrupting corresponding units in neural network models one could explore the resulting computational deficits and possible strategies for restoration of cognitive functions.