The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Input Variable Selection Methods for Construction of Interpretable Regression Models

Jarkko Tikka

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Faculty of Information and Natural Sciences for public examination and debate in Auditorium T1 at Helsinki University of Technology (Espoo, Finland) on the 12th of December, 2008, at 12 noon.

Overview in PDF format (ISBN 978-951-22-9664-4)   [1039 KB]
Dissertation is also available in print (ISBN 978-951-22-9663-7)

Abstract

Large data sets are collected and analyzed in a variety of research problems. Modern computers allow to measure ever increasing numbers of samples and variables. Automated methods are required for the analysis, since traditional manual approaches are impractical due to the growing amount of data. In the present thesis, numerous computational methods that are based on observed data with subject to modelling assumptions are presented for producing useful knowledge from the data generating system.

Input variable selection methods in both linear and nonlinear function approximation problems are proposed. Variable selection has gained more and more attention in many applications, because it assists in interpretation of the underlying phenomenon. The selected variables highlight the most relevant characteristics of the problem. In addition, the rejection of irrelevant inputs may reduce the training time and improve the prediction accuracy of the model.

Linear models play an important role in data analysis, since they are computationally efficient and they form the basis for many more complicated models. In this work, the estimation of several response variables simultaneously using the linear combinations of the same subset of inputs is especially considered. Input selection methods that are originally designed for a single response variable are extended to the case of multiple responses. The assumption of linearity is not, however, adequate in all problems. Hence, artificial neural networks are applied in the modeling of unknown nonlinear dependencies between the inputs and the response.

The first set of methods includes efficient stepwise selection strategies that assess usefulness of the inputs in the model. Alternatively, the problem of input selection is formulated as an optimization problem. An objective function is minimized with respect to sparsity constraints that encourage selection of the inputs. The trade-off between the prediction accuracy and the number of input variables is adjusted by continuous-valued sparsity parameters.

Results from extensive experiments on both simulated functions and real benchmark data sets are reported. In comparisons with existing variable selection strategies, the proposed methods typically improve the results either by reducing the prediction error or decreasing the number of selected inputs or with respect to both of the previous criteria. The constructed sparse models are also found to produce more accurate predictions than the models including all the input variables.

This thesis consists of an overview and of the following 7 publications:

  1. Timo Similä and Jarkko Tikka. 2005. Multiresponse sparse regression with application to multidimensional scaling. In: Włodzisław Duch, Janusz Kacprzyk, Erkki Oja, and Sławomir Zadrożny (editors). Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications (ICANN 2005). Part II. Warsaw, Poland. 11-15 September 2005. Springer-Verlag. Lecture Notes in Computer Science, volume 3697, pages 97-102. © 2005 by authors and © 2005 Springer Science+Business Media. By permission.
  2. Timo Similä and Jarkko Tikka. 2006. Common subset selection of inputs in multiresponse regression. In: Proceedings of the 2006 International Joint Conference on Neural Networks (IJCNN 2006). Vancouver, BC, Canada. 16-21 July 2006, pages 1908-1915. © 2006 IEEE. By permission.
  3. Timo Similä and Jarkko Tikka. 2007. Input selection and shrinkage in multiresponse linear regression. Computational Statistics & Data Analysis, volume 52, number 1, pages 406-422. © 2007 Elsevier Science. By permission.
  4. Jarkko Tikka and Jaakko Hollmén. 2008. Sequential input selection algorithm for long-term prediction of time series. Neurocomputing, volume 71, numbers 13-15, pages 2604-2615. © 2008 Elsevier Science. By permission.
  5. Jarkko Tikka and Jaakko Hollmén. 2008. Selection of important input variables for RBF network using partial derivatives. In: Michel Verleysen (editor). Proceedings of the 16th European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning (ESANN 2008). Bruges, Belgium. 23-25 April 2008. d-side publications, pages 167-172.
  6. Jarkko Tikka. 2007. Input selection for radial basis function networks by constrained optimization. In: Joaquim Marques de Sá, Luís A. Alexandre, Włodzisław Duch, and Danilo Mandic (editors). Proceedings of the 17th International Conference on Artificial Neural Networks (ICANN 2007). Part I. Porto, Portugal. 9-13 September 2007. Springer-Verlag. Lecture Notes in Computer Science, volume 4668, pages 239-248. © 2007 by author and © 2007 Springer Science+Business Media. By permission.
  7. Jarkko Tikka. 2008. Simultaneous input variable and basis function selection for RBF networks. Neurocomputing, accepted for publication. © 2008 by author and © 2008 Elsevier Science. By permission.

Keywords: data analysis, machine learning, function approximation, multiresponse linear regression, nonlinear regression, artificial neural networks, input variable selection, model selection, constrained optimization

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2008 Helsinki University of Technology


Last update 2011-05-26