Random Subspace with Trees for Feature Selection Under Memory Constraints

Sutera, Antonio; Châtel, Célia; Louppe, Gilles; Wehenkel, Louis; Geurts, Pierre

Paper published in a book (Scientific congresses and symposiums)

Sutera, Antonio; Châtel, Célia; Louppe, Gilles et al.

2018 • In Storkey, Amos; Perez-Cruz, Fernando (Eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

Peer reviewed

Permalink
https://hdl.handle.net/2268/226097

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

sutera18a.pdf

Publisher postprint (1.29 MB)

Download

Annexes

sutera18a-supp.pdf

Publisher postprint (421 kB)

Supplementary materials

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

machine learning; random forest; variable importances; random subspace; feature selection

Abstract :

[en] Dealing with datasets of very high dimension is a major challenge in machine learning. In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. In this setting, we propose a novel tree-based feature selection approach that builds a sequence of randomized trees on small subsamples of variables mixing both variables already identified as relevant by previous models and variables randomly selected among the other variables. As our main contribution, we provide an in-depth theoretical analysis of this method in infinite sample setting. In particular, we study its soundness with respect to common definitions of feature relevance and its convergence speed under various variable dependance scenarios. We also provide some preliminary empirical results highlighting the potential of the approach.

Disciplines :

Electrical & electronics engineering

Author, co-author :

Sutera, Antonio ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique

Châtel, Célia

Louppe, Gilles ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Big Data

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique

Language :

English

Title :

Random Subspace with Trees for Feature Selection Under Memory Constraints

Publication date :

2018

Event name :

The 21st International Conference on Artificial Intelligence and Statistics

Event place :

Playa Blanca, Spain

Event date :

du 9 au 11 avril 2018

Audience :

International

Main work title :

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

Editor :

Storkey, Amos

Perez-Cruz, Fernando

Publisher :

PMLR, Playa Blanca, Spain

Collection name :

Proceedings of Machine Learning Research

Pages :

929-937

Peer reviewed :

Peer reviewed

Tags :

CÉCI : Consortium des Équipements de Calcul Intensif

Additional URL :

http://proceedings.mlr.press/v84/sutera18a.html

Available on ORBi :

since 29 June 2018

Statistics

Number of views

124 (30 by ULiège)

Number of downloads

203 (21 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Nitesh Chawla, Lawrence Hall, Kevin Bowyer, and Philip Kegelmeyer. Learning ensembles from bites: A scalable and accurate approach. Journal of Machine Learning Research, 5:421–451, 2004.
Gilles Louppe and Pierre Geurts. Ensembles on random patches. In Machine Learning and Knowledge Discovery in Databases, pages 346–361. Springer, 2012.
Tin Kam Ho. The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8): 832–844, 1998.
Ludmila Kuncheva, Juan Rodríguez, Catrin Plumpton, David Linden, and Stephen Johnston. Random subspace ensembles for fmri classification. Medical Imaging, IEEE Transactions on, 29(2):531–542, 2010.
Michał Dramiński, Alvaro Rada-Iglesias, Stefan Enroth, Claes Wadelius, Jacek Koronacki, and Jan Komorowski. Monte carlo feature selection for supervised classification. Bioinformatics, 24(1):110–117, 2008.
Carmen Lai, Marcel Reinders, and Lodewyk Wessels. Random subspace method for multivariate feature selection. Pattern recognition letters, 27(10):1067–1076, 2006.
Ender Konukoglu and Melanie Ganz. Approximate false positive rate control in selection frequency for random forest. arXiv preprint arXiv:1410.2838, 2014.
Thanh-Tung Nguyen, He Zhao, Joshua Zhexue Huang, Thuy Thi Nguyen, and Mark Junjie Li. A new feature sampling method in random forests for predicting high-dimensional data. In Advances in Knowledge Discovery and Data Mining, pages 459–470. Springer, 2015.
Michał Dramiński, Michał Dabrowski, Klev Diamanti, Jacek Koronacki, and Jan Komorowski. Discovering networks of interdependent features in high-dimensional problems. In Big Data Analysis: New Algorithms for a New Society, pages 285–304. Springer, 2016.
Ron Kohavi and George John. Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324, 1997.
Roland Nilsson, José Peña, Johan Björkegren, and Jesper Tegnér. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 8:589–612, 2007.
Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classification and regression trees. 1984.
Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
Leo Breiman. Random forests. Machine learning, 45 (1):5–32, 2001.
Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable importances in forests of randomized trees. In Advances in neural information processing, volume 26, pages 431–439, 2013.
HervÃľ Stoppiglia, GÃľrard Dreyfus, RÃľmi Dubois, and Yacine Oussar. Ranking a random feature for variable and feature selection. Journal of Machine Learning Research, 3:1399–1414, 2003.
Silke Janitza, Ender Celik, and Anne-Laure Boulesteix. A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, pages 1–31, 2016.
Inaki Inza, Pedro Larrañaga, and Basilio Sierra. Feature Subset Selection by Estimation of Distribution Algorithms, volume 2, pages 269–293. Springer, 2002.
Dhammika Amaratunga, Javier Cabrera, and Yung-Seop Lee. Enriched random forests. Bioinformatics, 24(18):2010–2014, 2008.