The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Overfitting in Feature Selection: Pitfalls and Solutions

Juha Reunanen

Doctoral dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the School of Science for public examination and debate in Auditorium T2 at the Aalto University School of Science (Espoo, Finland) on the 30th of March 2012 at 12 noon.

Overview in PDF format (ISBN 978-952-60-4516-0)   [2475 KB]
Dissertation is also available in print (ISBN 978-952-60-4515-3)

Abstract

When facing a typical pattern recognition task, one usually comes up with a number of so-called features: properties that describe the objects to be recognised. Based on these features, the task of the classifier building algorithm is to find useful rules that are suitable for the recognition of new objects.

Feature selection is a process where one tries to identify the useful features from among a potentially large set of candidates. The task is notoriously hard, and researchers have been tackling it already for decades. Solving the problem properly might today be more important than ever before, because in many applications, dataset sizes seem to grow faster than does the processing power of computers. For example, in the domain of genetic microarray data, there can easily be thousands of features.

Several research groups have published comparisons aiming to identify the feature selection method that is universally the best. Unfortunately, too often the way that such comparisons are done is just plain wrong. Based on the results of such studies, the computationally intensive search algorithms seem to perform much better than the simple approaches. However, it is shown in this thesis that when the comparison is done properly, it very often turns out that the simple and fast algorithms give results that are just as good, if not even better.

In addition, many studies suggest that excluding some of the features is much more useful than it actually is. This observation is relevant in practice, because the selection process typically takes a lot of time and computing resources – therefore, it would be very convenient not to have to carry it out at all. This thesis shows that the benefits obtained may be negligible compared to what has been presented previously, provided that they are measured correctly.

Moreover, the thesis presents a better-performing approach for accuracy estimation in case the amount of data is small. Further, extensions are discussed from feature selection to generic model selection, from the selection of individual features to that of sensors measuring the features, and from classification to regression. Finally, an industrial application is pointed out where the methods discussed prove useful.

This thesis consists of an overview and of the following 5 publications:

  1. Juha Reunanen (2003). Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research, volume 3, pp. 1371-1382.
  2. Juha Reunanen (2004). A Pitfall in Determining the Optimal Feature Subset Size. In A. Fred (Ed.), Proceedings of the Fourth International Workshop on Pattern Recognition in Information Systems (PRIS 2004), Porto, Portugal, April 13-14, pp. 176-185.
  3. Juha Reunanen (2006). Less Biased Measurement of Feature Selection Benefits. In C. Saunders, M. Grobelnik, S. Gunn, J. Shawe-Taylor (Eds.), Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop (SLSFS 2005), Revised Selected Papers (LNCS 3940), pp. 198-208.
  4. Juha Reunanen (2007). Model Selection and Assessment Using Cross-indexing. In J. Si, R. Sun, D. Brown, I. King, N. Kasabov (Eds.), Proceedings of the Twentieth International Joint Conference on Neural Networks (IJCNN 2007), Orlando, Florida, USA, August 12-17, pp. 2581-2585.
  5. Juha Reunanen, Mika Mononen, Marko Vauhkonen, Anssi Lehikoinen, and Jari P. Kaipio (2011). Machine Learning Approach for Locating Phase Interfaces Using Conductivity Probes. Inverse Problems in Science and Engineering, volume 19, issue 6, pp. 879-902.

Keywords: machine learning, feature selection, variable selection, overfitting, search algorithms, comparison of algorithms, model selection, classification, regression

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2012 Aalto University


Last update 2012-10-31