Visualizing memorization in RNNs

Madsen, Andreas

doi:10.23915/distill.00016

1

Observe how the models predict the word “learning” with only the first two characters as input. The Nested LSTM model barely uses past information and thus only suggests common words starting with the letter “l”.

In contrast, the LSTM and GRU models both suggests the word “learning”. The GRU model shows stronger connectivity with the word “advanced”, and we see in the suggestions that it predicts a higher probability for “learning” than the LSTM model.

2

Examine how the models predict the word “grammar”. This word appears twice; when it appears for the first time the models have very little context. Thus, no model suggests “grammar” until it has seen at least 4 characters.

When “grammar” appears for the second time, the models have more context. The GRU model is able to predict the word “grammar” with only 1 character from the word itself. The LSTM and Nested LSTM again need at least 4 characters.

3

Finally, let’s look at predicting the word “schools” given only the first character. As in the other cases, the GRU model seems better at using past information for contextual understanding.

What makes this case noteworthy is how the LSTM model appears to use words from almost the entire sentence as context. However, its suggestions are far from correct and have little to do with the previous words it seems to use in its prediction. This suggests that the LSTM model in this setup is capable of long-term memorization, but not long-term contextual understanding.

[hochreiter1997lstm] Long short-term memory
Hochreiter, S. and Schmidhuber, J., 1997. Neural computation, Vol 9(8), pp. 1735--1780. MIT Press.

[cho2014gru] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation [PDF]
Cho, K., Merrienboer, B.v., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. arXivreprint arXiv:1406.1078.

[moniz2018nlstm] Nested LSTMs [PDF]
Moniz, J.R.A. and Krueger, D., 2018. arXivreprint arXiv:1801.10308.

[penntreebank] The Penn Treebank: Annotating Predicate Argument Structure [link]
Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K. and Schasberger, B., 1994. Proceedings of the Workshop on Human Language Technology, pp. 114--119. Association for Computational Linguistics. DOI: 10.3115/1075812.1075835

[text8] text8 Dataset [link]
Mahoney, M., 2006.

[pascanu2013vanishing] On the difficulty of training recurrent neural networks [PDF]
Pascanu, R., Mikolov, T. and Bengio, Y., 2013. arXivreprint arXiv:1211.5063.

[karpathy2015rnnvis] Visualizing and Understanding Recurrent Networks [PDF]
Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXivreprint arXiv:1506.02078.

Model	Units	Layers	Depth	Parameters
				Embedding	Recurrent	Dense
GRU	600	2	N/A	16200	4323600	9847986
LSTM	600	2	N/A	16200	5764800	9847986
Nested LSTM	600	1	2	16200	5764800	9847986

Model	Cross Entropy	Accuracy
GRU	2.1170	52.01%
LSTM	2.1713	51.40%
Nested LSTM	2.4950	47.10%

Visualizing memorization in RNNs

Authors

Affiliations

Published

DOI

Recurrent Units

Comparing Recurrent Units

A problem for qualitative analysis

Connectivity in the Autocomplete Problem

Future work; quantitative metric

Conclusion

Acknowledgments

Discussion and Review

Nested LSTM

Long Short-Term Memory

Autocomplete Problem

References

Footnotes

Updates and Corrections

Reuse

Citation