2018 • In Storkey, Amos; Perez-Cruz, Fernando (Eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics
machine learning; random forest; variable importances; random subspace; feature selection
Abstract :
[en] Dealing with datasets of very high dimension is a major challenge in machine learning. In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. In this setting, we propose a novel tree-based feature selection approach that builds a sequence of randomized trees on small subsamples of variables mixing both variables already identified as relevant by previous models and variables randomly selected among the other variables. As our main contribution, we provide an in-depth theoretical analysis of this method in infinite sample setting. In particular, we study its soundness with respect to common definitions of feature relevance and its convergence speed under various variable dependance scenarios. We also provide some preliminary empirical results highlighting the potential of the approach.
Disciplines :
Electrical & electronics engineering
Author, co-author :
Sutera, Antonio ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique
Châtel, Célia
Louppe, Gilles ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Big Data
Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique
Language :
English
Title :
Random Subspace with Trees for Feature Selection Under Memory Constraints
Publication date :
2018
Event name :
The 21st International Conference on Artificial Intelligence and Statistics
Event place :
Playa Blanca, Spain
Event date :
du 9 au 11 avril 2018
Audience :
International
Main work title :
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics
Editor :
Storkey, Amos
Perez-Cruz, Fernando
Publisher :
PMLR, Playa Blanca, Spain
Collection name :
Proceedings of Machine Learning Research
Pages :
929-937
Peer reviewed :
Peer reviewed
Tags :
CÉCI : Consortium des Équipements de Calcul Intensif
Nitesh Chawla, Lawrence Hall, Kevin Bowyer, and Philip Kegelmeyer. Learning ensembles from bites: A scalable and accurate approach. Journal of Machine Learning Research, 5:421–451, 2004.
Gilles Louppe and Pierre Geurts. Ensembles on random patches. In Machine Learning and Knowledge Discovery in Databases, pages 346–361. Springer, 2012.
Tin Kam Ho. The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8): 832–844, 1998.
Ludmila Kuncheva, Juan Rodríguez, Catrin Plumpton, David Linden, and Stephen Johnston. Random subspace ensembles for fmri classification. Medical Imaging, IEEE Transactions on, 29(2):531–542, 2010.
Michał Dramiński, Alvaro Rada-Iglesias, Stefan Enroth, Claes Wadelius, Jacek Koronacki, and Jan Komorowski. Monte carlo feature selection for supervised classification. Bioinformatics, 24(1):110–117, 2008.
Carmen Lai, Marcel Reinders, and Lodewyk Wessels. Random subspace method for multivariate feature selection. Pattern recognition letters, 27(10):1067–1076, 2006.
Ender Konukoglu and Melanie Ganz. Approximate false positive rate control in selection frequency for random forest. arXiv preprint arXiv:1410.2838, 2014.
Thanh-Tung Nguyen, He Zhao, Joshua Zhexue Huang, Thuy Thi Nguyen, and Mark Junjie Li. A new feature sampling method in random forests for predicting high-dimensional data. In Advances in Knowledge Discovery and Data Mining, pages 459–470. Springer, 2015.
Michał Dramiński, Michał Dabrowski, Klev Diamanti, Jacek Koronacki, and Jan Komorowski. Discovering networks of interdependent features in high-dimensional problems. In Big Data Analysis: New Algorithms for a New Society, pages 285–304. Springer, 2016.
Ron Kohavi and George John. Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324, 1997.
Roland Nilsson, José Peña, Johan Björkegren, and Jesper Tegnér. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 8:589–612, 2007.
Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classification and regression trees. 1984.
Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
Leo Breiman. Random forests. Machine learning, 45 (1):5–32, 2001.
Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable importances in forests of randomized trees. In Advances in neural information processing, volume 26, pages 431–439, 2013.
HervÃľ Stoppiglia, GÃľrard Dreyfus, RÃľmi Dubois, and Yacine Oussar. Ranking a random feature for variable and feature selection. Journal of Machine Learning Research, 3:1399–1414, 2003.
Silke Janitza, Ender Celik, and Anne-Laure Boulesteix. A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, pages 1–31, 2016.
Inaki Inza, Pedro Larrañaga, and Basilio Sierra. Feature Subset Selection by Estimation of Distribution Algorithms, volume 2, pages 269–293. Springer, 2002.
Dhammika Amaratunga, Javier Cabrera, and Yung-Seop Lee. Enriched random forests. Bioinformatics, 24(18):2010–2014, 2008.