Ilyas et al.

We show that at least 23.5% (out of 88%) of the accuracy can be explained by robust features in $D_\text{rand}$. This is a weak lower bound, established by a linear model, and does not perclude the possibility of further leakage. On the other hand, we find no evidence of leakage in $D_\text{det}$.

Our technique for quantifying leakage consisting of two steps:

- First, we construct features $f_i(x) = w_i^Tx$ that are provably robust, in a sense we will soon specify.
- Next, we train a linear classifier
as per on the datasets $\hat{\mathcal{D}}_{\text{det}}$ and $\hat{\mathcal{D}}_{\text{rand}}$ (Defined, Equation 3 , Table 1) on these robust features *only*.

Since Ilyas et al. *robust feature* in the multiclass
setting:

For at least one of the classes, the feature is $\gamma$-robustly useful

The feature comes from a robust model for which at least 80% of points in the test set have predictions that remain static in a neighborhood of radius 0.25 on the $L_2$ norm ball.

We find features that satisfy *both* specifications by using the 10 linear features of a robust linear
model trained on CIFAR-10. Because the features are linear, the above two conditions can be certified
analytically. We leave the reader to inspect the weights corresponding to the features manually:

Training a linear model on the above robust features on $\hat{\mathcal{D}}_{\text{rand}}$ and testing on the
CIFAR test set incurs an accuracy of **23.5%** (out of 88%). Doing the same on
$\hat{\mathcal{D}}_{\text{det}}$ incurs an accuracy of **6.81%** (out of 44%).

The contrasting results suggest that the the two experiements should be interpreted differently. The
transfer results of $\hat{\mathcal{D}}_{\text{rand}}$ in Table 1 of

The results of $\hat{\mathcal{D}}_{\text{det}}$ in Table 1 of

To cite Ilyas et al.’s response, please cite their
collection of responses.

**Response**: This comment raises a valid concern which was in fact one of
the primary reasons for designing the $\widehat{\mathcal{D}}_{det}$ dataset.
In particular, recall the construction of the $\widehat{\mathcal{D}}_{rand}$
dataset: assign each input a random target label and do PGD towards that label.
Note that unlike the $\widehat{\mathcal{D}}_{det}$ dataset (in which the
target class is deterministically chosen), the $\widehat{\mathcal{D}}_{rand}$
dataset allows for robust features to actually have a (small) positive
correlation with the label.

To see how this can happen, consider the following simple setting: we have a single feature $f(x)$ that is $1$ for cats and $-1$ for dogs. If $\epsilon = 0.1$ then $f(x)$ is certainly a robust feature. However, randomly assigning labels (as in the dataset $\widehat{\mathcal{D}}_{rand}$) would make this feature uncorrelated with the assigned label, i.e., we would have that $E[f(x)\cdot y] = 0$. Performing a targeted attack might in this case induce some correlation with the assigned label, as we could have $\mathbb{E}[f(x+\eta\cdot\nabla f(x))\cdot y] > \mathbb{E}[f(x)\cdot y] = 0$, allowing a model to learn to correctly classify new inputs.

In other words, starting from a dataset with no features, one can encode
robust features within small perturbations. In contrast, in the
$\widehat{\mathcal{D}}_{det}$ dataset, the robust features are *correlated
with the original label* (since the labels are permuted) and since they are
robust, they cannot be flipped to correlate with the newly assigned (wrong)
label. Still, the $\widehat{\mathcal{D}}_{rand}$ dataset enables us to show
that (a) PGD-based adversarial examples actually alter features in the data and
(b) models can learn from human-meaningless/mislabeled training data. The
$\widehat{\mathcal{D}}_{det}$ dataset, on the other hand, illustrates that the
non-robust features are actually sufficient for generalization and can be
preferred over robust ones in natural settings.

The experiment put forth in the comment is a clever way of showing that such
leakage is indeed possible. However, we want to stress (as the comment itself
does) that robust feature leakage does *not* have an impact on our main
thesis — the $\widehat{\mathcal{D}}_{det}$ dataset explicitly controls
for robust
feature leakage (and in fact, allows us to quantify the models’ preference for
robust features vs non-robust features — see Appendix D.6 in the
paper).

You can find more responses in the main discussion article.

Shan Carter (started the project), Preetum (technical discussion), Chris Olah (technical discussion), Ria (technical discussion), Aditiya (feedback)

- Adversarial examples are not bugs, they are features

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage", Distill, 2019.

BibTeX citation

@article{goh2019a, author = {Goh, Gabriel}, title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage}, journal = {Distill}, year = {2019}, note = {https://distill.pub/2019/advex-bugs-discussion/response-2}, doi = {10.23915/distill.00019.2} }

This article is part of a discussion of the Ilyas et al. paper

Other Comments Comment by Ilyas et al.“Adversarial examples are not bugs, they are features”.You can learn more in the main discussion article .