With diverse environments, we can analyze, diagnose and edit deep reinforcement learning models using attribution.
Observation (video game still) | Positive attribution (good news) | Negative attribution (bad news) |
---|---|---|
|
|
|
Attribution from a hidden layer to the value function, showing what features of the observation (left) are used to predict success (middle) and failure (right). Applying dimensionality reduction (NMF) yields features that detect various in-game objects.
![]() Coin ![]() Enemy ![]() Buzzsaw |
In this article, we apply interpretability techniques to a reinforcement learning (RL) model trained to play the video game CoinRun
Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability. If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse. We provide evidence for our hypothesis by measuring the relationship between interpretability and generalization.
Finally, we provide a thorough investigation of several interpretability techniques in the context of RL vision, and pose a number of questions for further research.
CoinRun is a side-scrolling platformer in which the agent must dodge enemies and other traps and collect the coin at the end of the level.
CoinRun is procedurally-generated, meaning that each new level encountered by the agent is randomly generated from scratch. This incentivizes the model to learn how to spot the different kinds of objects in the game, since it cannot get away with simply memorizing a small number of specific trajectories
Here are some examples of the objects used, along with walls and floors, to generate CoinRun levels.
Full resolution |
![]() ![]() |
![]() |
![]() |
![]() ![]() |
![]() ![]() |
![]() |
![]() |
Model resolution |
![]() ![]() |
![]() |
![]() |
![]() ![]() |
![]() ![]() |
![]() |
![]() |
There are 9 actions available to the agent in CoinRun:
← | → | ||
↓ | |||
↑ | ↖ | ↗ | |
A | B | C |
We trained a convolutional neural network on CoinRun for around 2 billion timesteps, using PPO
Since the only available reward is a fixed bonus for collecting the coin, the value function estimates the time-discounted
Having trained a strong RL agent, we were curious to see what it had learned. Following
Here is our interface for a typical trajectory, with the value function as the network output. It reveals the model using obstacles, coins, enemies and more to compute the value function.
Observation
Video game pixels seen by the model | Positive attribution
(good news) Color overlay shows objects predictive of success | Negative attribution
(bad news) Color overlay shows objects predictive of failure |
---|---|---|
Our fully-trained model fails to complete around 1 in every 200 levels. We explored a few of these failures using our interface, and found that we were usually able to understand why they occurred.
The failure often boils down to the fact that the model has no memory, and must therefore choose its action based only on the current observation. It is also common for some unlucky sampling of actions from the agent’s policy to be partly responsible.
Here are some cherry-picked examples of failures, carefully analyzed step-by-step.
|
The agent moves too far to the right while in mid-air as a result of a buzzsaw obstacle being temporarily hidden from view by a moving enemy. The buzzsaw comes back into view, but too late to avoid a collision.
|
Attribution for: | → | Policy: (probabilities and sampled action) | ← → ↓ ↑ ↗ ↖ A B C |
---|
Observation | Positive attribution | Negative attribution |
---|---|---|
We searched for errors in the model using generalized advantage estimation (GAE)
Using our interface, we found a couple of cases in which the model “hallucinated” a feature not present in the observation, causing the value function to spike.
|
At one point the value function spiked upwards from 95% to 98% for a single timestep. This was due to a curved yellow-brown shape in the background, which happened to appear next to a wall, being mistaken for a coin.
|
Observation | Positive attribution | Negative attribution |
---|---|---|
Our analysis so far has been mostly qualitative. To quantitatively validate our analysis, we hand-edited the model to make the agent blind to certain features identified by our interface: buzzsaw obstacles in one case, and left-moving enemies in another. Our method for this can be thought of as a primitive form of circuit-editing
We evaluated each edit by measuring the percentage of levels that the new agent failed to complete, broken down by the object that the agent collided with to cause the failure. Our results show that our edits were successful and targeted, with no statistically measurable effects on the agent’s other abilities.
Percentage of levels failed due to: buzzsaw obstacle / enemy moving left / enemy moving right / multiple or other:
- Original model: 0.37% / 0.16% / 0.12% / 0.08%
- Buzzsaw obstacle blindness: 12.76% / 0.16% / 0.08% / 0.05%
- Enemy moving left blindness: 0.36% / 4.69% / 0.97% / 0.07%
Each model was tested on 10,000 levels.
We did not manage to achieve complete blindness, however: the buzzsaw-edited model still performed significantly better than the original model did when we made the buzzsaws completely invisible.
Percentage of levels failed due to: buzzsaw obstacle / enemy moving left / enemy moving right / multiple or other:
Original model, invisible buzzsaws: 32.20% / 0.05% / 0.05% / 0.05%
We tested the model on 10,000 levels.
We experimented briefly with iterating the editing procedure, but were not able to achieve more than around 50% buzzsaw blindness by this metric without affecting the model’s other abilities.
Here are the original and edited models playing some cherry-picked levels.
|
|
|
|
All of the above analysis uses the same hidden layer of our network, the third of five convolutional layers, since it was much harder to find interpretable features at other layers. Interestingly, the level of abstraction at which this layer operates – finding the locations of various in-game objects – is exactly the level at which CoinRun levels are randomized using procedural generation. Furthermore, we found that training on many randomized levels was essential for us to be able to find any interpretable features at all.
This led us to suspect that the diversity introduced by CoinRun’s randomization is linked to the formation of interpretable features. We call this the diversity hypothesis:
Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
Our explanation for this hypothesis is as follows. For the forward implication (“only if”), we only expect features to be interpretable if they are general enough, and when the training distribution is not diverse enough, models have no incentive to develop features that generalize instead of overfitting. For the reverse implication (“if”), we do not expect it to hold in a strict sense: diversity on its own is not enough to guarantee the development of interpretable features, since they must also be relevant to the task. Rather, our intention with the reverse implication is to hypothesize that it holds very often in practice, as a result of generalization being bottlenecked by diversity.
In CoinRun, procedural generation is used to incentivize the model to learn skills that generalize to unseen levels
To test our hypothesis, we made the training distribution less diverse, by training the agent on a fixed set of 100 levels. This dramatically reduced our ability to interpret the model’s features. Here we display an interface for the new model, generated in the same way as the one above. The smoothly increasing value function suggests that the model has memorized the number of timesteps until the end of the level, and the features it uses for this focus on irrelevant background objects. Similar overfitting occurs for other video games with a limited number of levels
Observation
Video game pixels seen by the model | Positive attribution
(good news) Color overlay shows objects predictive of success | Negative attribution
(bad news) Color overlay shows objects predictive of failure |
---|---|---|
We attempted to quantify this effect by varying the number of levels used to train the agent, and evaluating the 8 features identified by our interface on how interpretable they were.
- Number of training levels: 100 / 300 / 1000 / 3,000 / 10,000 / 30,000 / 100,000
- Percentage of levels completed (train, run 1): 99.96% / 99.82% / 99.67% / 99.65% / 99.47% / 99.55% / 99.57%
- Percentage of levels completed (train, run 2): 99.97% / 99.86% / 99.70% / 99.46% / 99.39% / 99.50% / 99.37%
- Percentage of levels completed (test, run 1): 61.81% / 66.95% / 74.93% / 89.87% / 97.53% / 98.66% / 99.25%
- Percentage of levels completed (test, run 2): 64.13% / 67.64% / 73.46% / 90.36% / 97.44% / 98.89% / 99.35%
- Percentage of features interpretable (researcher 1, run 1): 52.5% / 22.5% / 11.25% / 45% / 90% / 75% / 91.25%
- Percentage of features interpretable (researcher 2, run 1): 8.75% / 8.75% / 10% / 26.25% / 56.25% / 90% / 70%
- Percentage of features interpretable (researcher 1, run 2): 15% / 13.75% / 15% / 23.75% / 53.75% / 90% / 96.25%
- Percentage of features interpretable (researcher 2, run 2): 3.75% / 6.25% / 21.25% / 45% / 72.5% / 83.75% / 77.5%
Percentages of levels completed are estimated by sampling 10,000 levels with replacement.
Our results illustrate how diversity may lead to interpretable features via generalization, lending support to the diversity hypothesis. Nevertheless, we still consider the hypothesis to be highly unproven.
Feature visualization
ImageNet | CoinRun | |
---|---|---|
First layer |
![]() ![]() ![]() |
![]() ![]() ![]() |
Intermediate layer |
![]() ![]() ![]() |
![]() ![]() ![]() |
Gradient-based feature visualization has previously been shown to struggle with RL models trained on Atari games
As shown below, we were able to use dataset examples to identify a number of channels that pick out human-interpretable features. It is therefore striking how resistant gradient-based methods were to our efforts. We believe that this is because solving CoinRun does not ultimately require much visual ability. Even with our modifications, it is possible to solve the game using simple visual shortcuts, such as picking out certain small configurations of pixels. These shortcuts work well on the narrow distribution of images on which the model is trained, but behave unpredictably in the full space of images in which gradient-based optimization takes place.
Our analysis here provides further insight into the diversity hypothesis. In support of the hypothesis, we have examples of features that are hard to interpret in the absence of diversity. But there is also evidence that the hypothesis may need to be refined. Firstly, it seems to be a lack of diversity at a low level of abstraction that harms our ability to interpret features at all levels of abstraction, which could be due to the fact that gradient-based feature visualization needs to back-propagate through earlier layers. Secondly, the failure of our efforts to increase low-level visual diversity suggests that diversity may need to be assessed in the context of the requirements of the task.
As an alternative to gradient-based feature visualization, we use dataset examples. This idea has a long history, and can be thought of as a heavily-regularized form of feature visualization
Unlike gradient-based feature visualization, this method finds some meaning to the different directions in activation space. However, it may still fail to provide a complete picture for each direction, since it only shows a limited number of dataset examples, and with limited context.
CoinRun observations differ from natural images in that they are much less spatially invariant. For example, the agent always appears in the center, and the agent’s velocity is always encoded in the top left. As a result, some features detect unrelated things at different spatial positions, such as reading the agent’s velocity in the top left while detecting an unrelated object elsewhere. To account for this, we developed a spatially-aware version of dataset example-based feature visualization, in which we fix each spatial position in turn, and choose the observation with the strongest activation at that position (with a limited number of reuses of the same observation, for diversity). This creates a spatial correspondence between visualizations and observations.
Here is such a visualization for a feature that responds strongly to coins. The white squares in the top left show that the feature also responds strongly to the horizontal velocity info when it is white, corresponding to the agent moving right at full speed.
Attribution
We showed above that a dimensionality reduction method known as non-negative matrix factorization (NMF) could be applied to the channels of activations to produce meaningful directions in activation space
Following
Observation | Positive attribution (good news) | Negative attribution (bad news) |
---|---|---|
|
|
|
For the full version of our interface, we simply repeat this for an entire trajectory of the agent playing the game. We also incorporate video controls, a timeline view of compressed observations
Attributions for our CoinRun model have some interesting properties that would be unusual for an ImageNet model.
These considerations suggest that some care may be required when interpreting attributions.
We are motivated to study interpretability for RL for two reasons.
We think that large neural networks are currently the most likely type of model to be used in highly capable and influential AI systems in the future. Contrary to the traditional perception of neural networks as black boxes, we think that there is a fighting chance that we will be able to clearly and thoroughly understand the behavior even of very large networks. We are therefore most excited by neural network interpretability research that scores highly according to the following criteria.
Our proposed questions reflect this perspective. One of the reasons we emphasize diversity relates to exhaustiveness. If “non-diverse features” remain when diversity is present, then our current techniques are not exhaustive and could end up missing important features of more capable models. Developing tools to understand non-diverse features may shed light on whether this is likely to be a problem.
We think there may be significant mileage in simply applying existing interpretability techniques, with attention to detail, to more models. Indeed, this was the mindset with which we initially approached this project. If the diversity hypothesis is correct, then this may become easier as we train our models to perform more complex tasks. Like early biologists encountering a new species, there may be a lot we can glean from taking a magnifying glass to the creatures in front of us.
lucid.scratch.rl_util
, a submodule of Lucid. We demonstrate these in a Here we explain our method for editing the model to make the agent blind to certain features.
The features in our interface correspond to directions in activation space obtained by applying attribution-based NMF to layer 2b of our model. To blind the agent to a feature, we edit the weights to make them project out the corresponding NMF direction.
More precisely, let be the NMF direction corresponding to the feature we wish to blind the model to. This is a vector of length , the number of channels in activation space. Using this we construct the orthogonal projection matrix , which projects out the direction of from activation vectors. We then take the convolutional kernel of the following layer, which has shape , where is the number of output channels. Broadcasting across the height and width dimensions, we left-multiply each matrix in the kernel by . The effect of the new kernel is to project out the direction of from activations before applying the original kernel.
As it turned out, the NMF directions were close to one-hot, so this procedure is approximately equivalent to zeroing out the slice of the kernel corresponding to a particular in-channel.
Here we explain the application of integrated gradients
Let be the value function computed by our network, which accepts a 64x64 RGB observation. Given any layer in the network, we may write as , where computes the layer’s activations. Given an observation , a simple method of attribution is to compute , where and denotes the pointwise product. This tells us the sensitivity of the value function to each activation, multiplied by the strength of that activation. However, it uses the sensitivity of the value function at the activation itself, which does not account for the fact that this sensitivity may change as the activation is increased from zero.
To account for this, the integrated gradients method instead chooses a path in activation space from some starting point to the ending point . We then compute the integrated gradient of along , which is defined as the path integral Note the use of the pointwise product rather than the usual dot product here, which makes the integral vector-valued. By the fundamental theorem of calculus for line integrals, when the components of the vector produced by this integral are summed, the result depends only on the endpoints and , equaling . Thus the components of this vector provide a true decomposition of this difference, “attributing” it across the activations.
For our purposes, we take to be the straight line from to .
This has the same dimensions as , and its components sum to . So for a convolutional layer, this method allows us to attribute the value function (in excess of the baseline ) across the horizontal, vertical and channel dimensions of activation space. Positive value function attribution can be thought of as “good news”, components that cause the agent to think it is more likely to collect the coin at the end of the level. Similarly, negative value function attribution can be thought of as “bad news”.
Our architecture consists of the following layers in the order given, together with ReLU activations for all except the final layer.
We designed this architecture by starting with the architecture from IMPALA
The choice that seemed to make the most difference was using 5 rather than 12 convolutional layers, resulting in the object-identifying features (which were the most interpretable, as discussed above) being concentrated in a single layer (layer 2b), rather than being spread over multiple layers and mixed in with less interpretable features.
We would like to thank our reviewers Jonathan Uesato, Joel Lehman and one anonymous reviewer for their detailed and thoughtful feedback. We would also like to thank Karl Cobbe, Daniel Filan, Sam Greydanus, Christopher Hesse, Jacob Jackson, Michael Littman, Ben Millwood, Konstantinos Mitsopoulos, Mira Murati, Jorge Orbay, Alex Ray, Ludwig Schubert, John Schulman, Ilya Sutskever, Nevan Wichers, Liang Zhang and Daniel Ziegler for research discussions, feedback, follow-up work, help and support that have greatly benefited this project.
Jacob Hilton was the primary contributor.
Nick Cammarata developed the model editing methodology and suggested applying it to CoinRun models.
Shan Carter (while working at OpenAI) advised on interface design throughout the project, and worked on many of the diagrams in the article.
Gabriel Goh provided evaluations of feature interpretability for the section Interpretability and generalization.
Chris Olah guided the direction of the project, performing initial exploratory research on the models, coming up with many of the research ideas, and helping to construct the article’s narrative.
Review 1 - Anonymous
Review 2 - Jonathan Uesato
Review 3 - Joel Lehman
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Hilton, et al., "Understanding RL Vision", Distill, 2020.
BibTeX citation
@article{hilton2020understanding, author = {Hilton, Jacob and Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris}, title = {Understanding RL Vision}, journal = {Distill}, year = {2020}, note = {https://distill.pub/2020/understanding-rl-vision}, doi = {10.23915/distill.00029} }