The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Data Exploration with Self-Organizing Maps in Environmental Informatics and Bioinformatics

Mikko T. Kolehmainen

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Computer Science and Engineering for public examination and debate in Auditorium AS1 (TUAS, Otaniementie 17) at Helsinki University of Technology (Espoo, Finland) on the 27th of February, 2004, at 12 o'clock noon.

Overview in PDF format (ISBN 951-27-0000-X)   [1091 KB]
Dissertation is also available in print (ISBN 951-781-305-8)


The aim of this thesis was to evaluate the usability of self-organizing maps and some other methods of computational intelligence in analysing and modelling problems of environmental informatics and bioinformatics. The concepts of environmental informatics, bioinformatics, computational intelligence and data mining are first defined. There follows an introduction to the data processing chain of knowledge discovery and the methods used in this thesis, namely linear regression, self-organizing maps (SOM), Sammon's mapping, U-matrix representation, fuzzy logic, c-means and fuzzy c-means clustering, multi-layer perceptron (MLP), and regularization and Bayesian techniques. The challenges posed by environmental processes and bioprocesses are then identified, including missing data problems, complex lagged dependencies among variables, non-linear chaotic dynamics, ill-defined inverse problems, and large search space in optimization tasks.

The works included in this thesis are then evaluated and discussed. The results show that the combination of SOM and Sammon's mapping has great potential in data exploration, and can be used to reveal important features of the measurement techniques (e.g. separability of compounds), reveal new information about already studied phenomena, speed up research work, act as a hypothesis generator for traditional research, and supply clear and intuitive visualization of the environmental phenomenon studied. The results of regression studies show, as expected, that the MLP network yields better estimates in predicting future values of airborne pollutant concentration of NO2 compared with SOM based regression or the least squares approach using periodic components. Additionally, the use of local MLP models is shown to be slightly better for estimating future values of episodes compared with one MLP model only. However, it can be concluded in general that the architectural issues tested are not able to solve solely model performance problems.

Finally, recommendations for future work are laid out. Firstly, the data exploration solution should be enhanced with methods from signal processing to enable the handling of measurements with different time scale and lagged multivariate time-series. The main suggestion, however, is to create an integrated environment for testing different hybrid schemes of computational intelligence for better time-series forecasting in environmental informatics and bioinformatics.

This thesis consists of an overview and of the following 7 publications:

  1. Kolehmainen M., Martikainen H., Hiltunen T. and Ruuskanen J., 2000. Forecasting air quality parameters using hybrid neural network modelling. Environmental Monitoring and Assessment 65, number 1-2, pages 277-286.
  2. Kolehmainen M., Martikainen H. and Ruuskanen J., 2001. Neural networks and periodic components used in air quality forecasting. Atmospheric Environment 35, number 5, pages 815-825.
  3. Kolehmainen M., Rissanen E., Raatikainen O. and Ruuskanen J., 2001. Monitoring odorous sulfur emissions using self-organizing maps for handling ion mobility spectrometry data. Journal of Air and Waste Management 51, pages 966-971.
  4. Kolehmainen M., Rönkkö P. and Raatikainen O., 2003. Monitoring of yeast fermentation by ion mobility spectrometry measurement and data visualisation with Self-Organizing Maps. Analytica Chimica Acta 484, number 1, pages 93-100.
  5. Niska H., Hiltunen T., Kolehmainen M. and Ruuskanen J., 2003. Hybrid models for forecasting air pollution episodes. International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA'03). University Technical Institute of Roanne, France, 23-25 April 2003. Wien, Springer-Verlag, pages 80-84.
  6. Törönen P., Kolehmainen M., Wong G. and Castrén E., 1999. Analysis of gene expression data using self-organizing maps. Federation of European Biochemical Societies (FEBS) Letters 451, number 2, pages 142-146.
  7. Valkonen V.-P., Kolehmainen M., Lakka H.-M. and Salonen J., 2002. Insulin resistance syndrome revisited: application of self-organizing maps. International Journal of Epidemiology 31, number 4, pages 864-871.

Errata of publications 1, 2 and 5

Keywords: environmental science computing, biology computing, data analysis, data mining, knowledge acquisition, self-organising feature maps, neural nets

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2004 Helsinki University of Technology

Last update 2011-05-26