Warming up recurrent neural networks to maximise reachable multistability greatly improves learning

[en] Training recurrent neural networks is known to be difficult when time dependencies become long. In this work, we show that most standard cells only have one stable equilibrium at initialisation, and that learning on tasks with long time dependencies generally occurs once the number of network stable equilibria increases; a property known as multistability. Multistability is often not easily attained by initially monostable networks, making learning of long time dependencies between inputs and outputs difficult. This insight leads to the design of a novel way to initialise any recurrent cell connectivity through a procedure called “warmup” to improve its capability to learn arbitrarily long time dependencies. This initialisation procedure is designed to maximise network reachable multistability, i.e., the number of equilibria within the network that can be reached through relevant input trajectories, in few gradient steps. We show on several information restitution, sequence classification, and reinforcement learning benchmarks that warming up greatly improves learning speed and performance, for multiple recurrent cells, but sometimes impedes precision. We therefore introduce a double-layer architecture initialised with a partial warmup that is shown to greatly improve learning of long time dependencies while maintaining high levels of precision. This approach provides a general framework for improving learning abilities of any recurrent cell when long time dependencies are present. We also show empirically that other initialisation and pretraining procedures from the literature implicitly foster reachable multistability of recurrent cells.

Disciplines :

Computer science

Author, co-author :

Lambrechts, Gaspard ^✱; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

De Geeter, Florent ^✱; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Systèmes et modélisation

Vecoven, Nicolas ^✱

Ernst, Damien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Drion, Guillaume ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

^✱ These authors have contributed equally to this work.

Language :

English

Title :

Warming up recurrent neural networks to maximise reachable multistability greatly improves learning

Publication date :

August 2023

Journal title :

Neural Networks

ISSN :

0893-6080

eISSN :

1879-2782

Publisher :

Elsevier, United Kingdom

Volume :

166

Pages :

645-669

Peer reviewed :

Peer Reviewed verified by ORBi

Tags :

CÉCI : Consortium des Équipements de Calcul Intensif
Tier-1 supercomputer

Additional URL :

https://arxiv.org/abs/2106.01001
https://www.sciencedirect.com/science/article/pii/S0893608023003817

Available on ORBi :

since 03 June 2021

Statistics

Number of views

630 (127 by ULiège)

Number of downloads

250 (39 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

Bakker, B., Reinforcement learning with long short-term memory. Advances in Neural Information Processing Systems, 14, 2001.
Bengio, Y., Frasconi, P., Simard, P., The problem of learning long-term dependencies in recurrent networks. IEEE International Conference on Neural Networks, 1993, IEEE, 1183–1188.
Ceni, A., Ashwin, P., Livi, L., Interpreting recurrent neural networks behaviour via excitable network attractors. Cognitive Computation 12:2 (2020), 330–356.
Chen, J., Chaudhari, N., Segmented-memory recurrent neural networks. IEEE Transactions on Neural Networks 20:8 (2009), 1267–1280, 10.1109/TNN.2009.2022980 URL http://ieeexplore.ieee.org/document/5164893/.
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y., On the properties of neural machine translation: Encoder-decoder approaches. 2014 arXiv preprint arXiv:1409.1259.
Chung, J., Ahn, S., Bengio, Y., Hierarchical multiscale recurrent neural networks. 2017, 10.48550/arXiv.1609.01704 URL arXiv:1609.01704[cs].
Doya, K., Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on Neural Networks, 1(75), 1993, 218.
Hausknecht, M., Stone, P., Deep recurrent Q-learning for partially observable MDPs. 2015 AAAI Fall Symposium Series, 2015.
Hihi, S., Bengio, Y., Hierarchical recurrent neural networks for long-term dependencies. Advances in Neural Information Processing Systems, 8, 1995, MIT Press URL https://proceedings.neurips.cc/paper/1995/hash/c667d53acd899a97a85de0c201ba99be-Abstract.html.
Hochreiter, S., Schmidhuber, J., Long short-term memory. Neural Computation 9:8 (1997), 1735–1780.
Ienco, D., Interdonato, R., Gaetano, R., Supervised level-wise pretraining for recurrent neural network initialization in multi-class classification. 2019 URL arXiv:1911.01071[cs, stat].
Kaelbling, L.P., Littman, M.L., Cassandra, A.R., Planning and acting in partially observable stochastic domains. Artificial Intelligence 101:1–2 (1998), 99–134.
Katz, G.E., Reggia, J.A., Using directional fibers to locate fixed points of recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems 29:8 (2017), 3636–3646.
Kingma, D.P., Ba, J., Adam: A method for stochastic optimization. 2014 arXiv preprint arXiv:1412.6980.
Koutnik, J., Greff, K., Gomez, F., Schmidhuber, J., A clockwork RNN. Proceedings of the 31st International Conference on Machine Learning, 2014, PMLR, 1863–1871 URL https://proceedings.mlr.press/v32/koutnik14.html, ISSN: 1938-7228.
Lin, T., Horne, B.G., Tino, P., Giles, C.L., Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks 7:6 (1996), 1329–1338, 10.1109/72.548162.
Maheswaranathan, N., Williams, A., Golub, M., Ganguli, S., Sussillo, D., Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances In Neural Information Processing Systems, 32, 2019.
Menezes, J.M.P., Barreto, G.A., Long-term time series prediction with the NARX network: An empirical evaluation. Neurocomputing 71:16 (2008), 3335–3343, 10.1016/j.neucom.2008.01.030 URL https://www.sciencedirect.com/science/article/pii/S0925231208003081.
Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., Ranzato, M., Learning longer memory in recurrent neural networks. 2015 URL arXiv:1412.7753[cs].
Ong, B.T., Sugiura, K., Zettsu, K., Dynamic pre-training of deep recurrent neural networks for predicting environmental monitoring data. 2014 IEEE International Conference on Big Data (Big Data), 2014, IEEE, Washington, DC, USA, 760–765, 10.1109/BigData.2014.7004302 URL http://ieeexplore.ieee.org/document/7004302/.
Pasa, L., Sperduti, A., Pre-training of recurrent neural networks via linear autoencoders. Advances in Neural Information Processing Systems, 27, 2014, Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2014/hash/f0fcf351df4eb6786e9bb6fc4e2dee02-Abstract.html.
Pasa, L., Testolin, A., Sperduti, A., Neural networks for sequential data: a pre-training approach based on hidden Markov models. Neurocomputing 169 (2015), 323–333, 10.1016/j.neucom.2014.11.081 URL https://linkinghub.elsevier.com/retrieve/pii/S0925231215003689.
Pascanu, R., Mikolov, T., Bengio, Y., On the difficulty of training recurrent neural networks. International Conference on Machine Learning, 2013, PMLR, 1310–1318.
Porta, J.M., Spaan, M.T., Vlassis, N., Value iteration for continuous-state POMDPs. 2004.
Sagheer, A., Kotb, M., Unsupervised Pre-training of a deep LSTM-based stacked autoencoder for multivariate time series forecasting problems. Scientific Reports, 9(1), 2019, 19038, 10.1038/s41598-019-55320-6 URL https://www.nature.com/articles/s41598-019-55320-6, Number: 1 Publisher: Nature Publishing Group.
Smallwood, R.D., Sondik, E.J., The optimal control of partially observable Markov processes over a finite horizon. Operations Research 21:5 (1973), 1071–1088.
Sussillo, D., Barak, O., Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Computation 25:3 (2013), 626–649.
Tallec, C., Ollivier, Y., Can recurrent neural networks warp time?. International Conference on Learning Representations, 2018.
Tang, Z., Wang, D., Zhang, Z., Recurrent neural network training with dark knowledge transfer. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, IEEE, Shanghai, 5900–5904, 10.1109/ICASSP.2016.7472809 URL http://ieeexplore.ieee.org/document/7472809/.
Trinh, T.H., Dai, A.M., Luong, M.-T., Le, Q.V., Learning longer-term dependencies in RNNs with auxiliary losses. 2018 URL arXiv:1803.00144[cs, stat].
Van Der Westhuizen, J., Lasenby, J., The unreasonable effectiveness of the forget gate. 2018 arXiv preprint arXiv:1804.04849.
Vecoven, N., Ernst, D., Drion, G., A bio-inspired bistable recurrent cell allows for long-lasting memory. PLoS One, 16(6), 2021, e0252676.
Werbos, P.J., Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78:10 (1990), 1550–1560.
Williams, R.J., Zipser, D., Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: theory, architectures, and applications, 1995, L. Erlbaum Associates Inc., USA, 433–486.
Zhou, G.-B., Wu, J., Zhang, C.-L., Zhou, Z.-H., Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing 13:3 (2016), 226–234.