The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Advances in Variable Selection and Visualization Methods for Analysis of Multivariate Data

Timo Similä

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Computer Science and Engineering for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 19th of October, 2007, at 12 o'clock noon.

Overview in PDF format (ISBN 978-951-22-8930-1)   [2727 KB]
Dissertation is also available in print (ISBN 978-951-22-8929-5)


This thesis concerns the analysis of multivariate data. The amount of data that is obtained from various sources and stored in digital media is growing at an exponential rate. The data sets tend to be too large in terms of the number of variables and the number of observations to be analyzed by hand. In order to facilitate the task, the data set must be summarized somehow. This work introduces machine learning methods that are capable of finding interesting patterns automatically from the data. The findings can be further used in decision making and prediction. The results of this thesis can be divided into three groups.

The first group of results is related to the problem of selecting a subset of input variables in order to build an accurate predictive model for several response variables simultaneously. Variable selection is a difficult combinatorial problem in essence, but the relaxations examined in this work transform it into a more tractable optimization problem of continuous-valued parameters. The main contribution here is extending several methods that are originally designed for a single response variable to be applicable with multiple response variables as well. Examples of such methods include the well known lasso estimate and the least angle regression algorithm.

The second group of results concerns unsupervised variable selection, where all variables are treated equally without making any difference between responses and inputs. The task is to detect the variables that contain, in some sense, as much information as possible. A related problem that is also examined is combining the two major categories of dimensionality reduction: variable selection and subspace projection. Simple modifications of the multiresponse regression techniques developed in this thesis offer a fresh approach to these unsupervised learning tasks. This is another contribution of the thesis.

The third group of results concerns extensions and applications of the self-organizing map (SOM). The SOM is a prominent tool in the initial exploratory phase of multivariate analysis. It provides a clustering and a visual low-dimensional representation of a set of high-dimensional observations. Firstly, an extension of the SOM algorithm is proposed in this thesis, which is applicable to strongly curvilinear but intrinsically low-dimensional data structures. Secondly, an application of the SOM is proposed to interpret nonlinear quantile regression models. Thirdly, a SOM-based method is introduced for analyzing the dependency of one multivariate data set on another.

This thesis consists of an overview and of the following 7 publications:

  1. Timo Similä and Sampsa Laine (2005). Visual approach to supervised variable selection by self-organizing map, International Journal of Neural Systems 15 (1-2): 101-110.
  2. Timo Similä (2005). Self-organizing map learning nonlinearly embedded manifolds, Information Visualization 4 (1): 22-31. © 2005 Palgrave Macmillan. By permission.
  3. Timo Similä and Jarkko Tikka (2005). Multiresponse sparse regression with application to multidimensional scaling, in W. Duch, J. Kacprzyk, E. Oja and S. Zadrozny (eds), Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications - ICANN 2005, Part II, Springer, Lecture Notes in Computer Science, Vol. 3697, pp. 97-102. © 2005 Springer Science+Business Media. By permission.
  4. Timo Similä (2006). Self-organizing map visualizing conditional quantile functions with multidimensional covariates, Computational Statistics & Data Analysis 50 (8): 2097-2110.
  5. Timo Similä and Jarkko Tikka (2006). Common subset selection of inputs in multiresponse regression, Proceedings of the 2006 IEEE International Joint Conference on Neural Networks - IJCNN 2006, pp. 1908-1915. © 2006 IEEE. By permission.
  6. Timo Similä (2007). Majorize-minimize algorithm for multiresponse sparse regression, Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing - ICASSP 2007, Vol. II, pp. 553-556. © 2007 IEEE. By permission.
  7. Timo Similä and Jarkko Tikka (2007). Input selection and shrinkage in multiresponse linear regression, Computational Statistics & Data Analysis 52 (1): 406-422. © 2007 Elsevier Science. By permission.

Keywords: machine learning, dimensionality reduction, regression, information visualization, variable selection

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2007 Helsinki University of Technology

Last update 2011-05-26