Recurrent neural network; Multistability; Initialisation procedure; Long-term memory; Warmup; Long time dependencies
[en] Training recurrent neural networks is known to be difficult when time dependencies become long. In this work, we show that most standard cells only have one stable equilibrium at initialisation, and that learning on tasks with long time dependencies generally occurs once the number of network stable equilibria increases; a property known as multistability. Multistability is often not easily attained by initially monostable networks, making learning of long time dependencies between inputs and outputs difficult. This insight leads to the design of a novel way to initialise any recurrent cell connectivity through a procedure called “warmup” to improve its capability to learn arbitrarily long time dependencies. This initialisation procedure is designed to maximise network reachable multistability, i.e., the number of equilibria within the network that can be reached through relevant input trajectories, in few gradient steps. We show on several information restitution, sequence classification, and reinforcement learning benchmarks that warming up greatly improves learning speed and performance, for multiple recurrent cells, but sometimes impedes precision. We therefore introduce a double-layer architecture initialised with a partial warmup that is shown to greatly improve learning of long time dependencies while maintaining high levels of precision. This approach provides a general framework for improving learning abilities of any recurrent cell when long time dependencies are present. We also show empirically that other initialisation and pretraining procedures from the literature implicitly foster reachable multistability of recurrent cells.
Author, co-author :
Lambrechts, Gaspard ✱; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
De Geeter, Florent ✱; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Systèmes et modélisation
Vecoven, Nicolas ✱
Ernst, Damien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Drion, Guillaume ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
✱ These authors have contributed equally to this work.
Warming up recurrent neural networks to maximise reachable multistability greatly improves learning
Ienco, D., Interdonato, R., Gaetano, R., Supervised level-wise pretraining for recurrent neural network initialization in multi-class classification. 2019 URL arXiv:1911.01071[cs, stat].
Kaelbling, L.P., Littman, M.L., Cassandra, A.R., Planning and acting in partially observable stochastic domains. Artificial Intelligence 101:1–2 (1998), 99–134.
Katz, G.E., Reggia, J.A., Using directional fibers to locate fixed points of recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems 29:8 (2017), 3636–3646.
Kingma, D.P., Ba, J., Adam: A method for stochastic optimization. 2014 arXiv preprint arXiv:1412.6980.
Koutnik, J., Greff, K., Gomez, F., Schmidhuber, J., A clockwork RNN. Proceedings of the 31st International Conference on Machine Learning, 2014, PMLR, 1863–1871 URL https://proceedings.mlr.press/v32/koutnik14.html, ISSN: 1938-7228.
Maheswaranathan, N., Williams, A., Golub, M., Ganguli, S., Sussillo, D., Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances In Neural Information Processing Systems, 32, 2019.
Menezes, J.M.P., Barreto, G.A., Long-term time series prediction with the NARX network: An empirical evaluation. Neurocomputing 71:16 (2008), 3335–3343, 10.1016/j.neucom.2008.01.030 URL https://www.sciencedirect.com/science/article/pii/S0925231208003081.
Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., Ranzato, M., Learning longer memory in recurrent neural networks. 2015 URL arXiv:1412.7753[cs].
Ong, B.T., Sugiura, K., Zettsu, K., Dynamic pre-training of deep recurrent neural networks for predicting environmental monitoring data. 2014 IEEE International Conference on Big Data (Big Data), 2014, IEEE, Washington, DC, USA, 760–765, 10.1109/BigData.2014.7004302 URL http://ieeexplore.ieee.org/document/7004302/.
Pasa, L., Sperduti, A., Pre-training of recurrent neural networks via linear autoencoders. Advances in Neural Information Processing Systems, 27, 2014, Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2014/hash/f0fcf351df4eb6786e9bb6fc4e2dee02-Abstract.html.
Pasa, L., Testolin, A., Sperduti, A., Neural networks for sequential data: a pre-training approach based on hidden Markov models. Neurocomputing 169 (2015), 323–333, 10.1016/j.neucom.2014.11.081 URL https://linkinghub.elsevier.com/retrieve/pii/S0925231215003689.
Pascanu, R., Mikolov, T., Bengio, Y., On the difficulty of training recurrent neural networks. International Conference on Machine Learning, 2013, PMLR, 1310–1318.
Porta, J.M., Spaan, M.T., Vlassis, N., Value iteration for continuous-state POMDPs. 2004.
Sagheer, A., Kotb, M., Unsupervised Pre-training of a deep LSTM-based stacked autoencoder for multivariate time series forecasting problems. Scientific Reports, 9(1), 2019, 19038, 10.1038/s41598-019-55320-6 URL https://www.nature.com/articles/s41598-019-55320-6, Number: 1 Publisher: Nature Publishing Group.
Smallwood, R.D., Sondik, E.J., The optimal control of partially observable Markov processes over a finite horizon. Operations Research 21:5 (1973), 1071–1088.
Sussillo, D., Barak, O., Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Computation 25:3 (2013), 626–649.
Tallec, C., Ollivier, Y., Can recurrent neural networks warp time?. International Conference on Learning Representations, 2018.
Tang, Z., Wang, D., Zhang, Z., Recurrent neural network training with dark knowledge transfer. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, IEEE, Shanghai, 5900–5904, 10.1109/ICASSP.2016.7472809 URL http://ieeexplore.ieee.org/document/7472809/.
Trinh, T.H., Dai, A.M., Luong, M.-T., Le, Q.V., Learning longer-term dependencies in RNNs with auxiliary losses. 2018 URL arXiv:1803.00144[cs, stat].
Van Der Westhuizen, J., Lasenby, J., The unreasonable effectiveness of the forget gate. 2018 arXiv preprint arXiv:1804.04849.
Vecoven, N., Ernst, D., Drion, G., A bio-inspired bistable recurrent cell allows for long-lasting memory. PLoS One, 16(6), 2021, e0252676.
Werbos, P.J., Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78:10 (1990), 1550–1560.
Williams, R.J., Zipser, D., Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: theory, architectures, and applications, 1995, L. Erlbaum Associates Inc., USA, 433–486.
Zhou, G.-B., Wu, J., Zhang, C.-L., Zhou, Z.-H., Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing 13:3 (2016), 226–234.