The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Exploratory Cluster Analysis of Genomic High-Throughput Data Sets and Their Dependencies

Janne Nikkilä

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Computer Science and Engineering for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 1^st of December, 2005, at 12 o'clock noon.

Overview in PDF format (ISBN 951-22-7909-6) [1492 KB]
Dissertation is also available in print (ISBN 951-22-7908-8)

Abstract

This thesis studies exploratory cluster analysis of genomic high-throughput data sets and their interdependencies. In modern biology, new high-throughput measurements generate numerical data simultaneously from thousands of molecules in the cell. This enables a new perspective to biology, which is called systems biology. The discipline developing methods for the analysis of the systems biology data is called bioinformatics. The work in this thesis contributes mainly to bioinformatics, but the approaches presented are general purpose machine learning methods and can be applied in many problem areas.

A main problem in analyzing genomic high-throughput data is that the potentially useful new findings are hidden in a huge data mass. They need to be extracted and visualized to the analyst as overviews.

This thesis introduces new exploratory cluster analysis methods for extracting and visualizing findings of high-throughput data. Three kinds of methods are presented to solve progressively better-focused problems. First, visualizations and clusterings using the self-organizing map are applied to genomic data sets. Second, the recently developed methods for improving the visualization and clustering of a data set with auxiliary data are applied. Third, new methods for exploring the dependency between data sets are developed and applied. The new methods are based on maximizing the Bayes factor between the model of independence and the model of dependence for finite data.

The methods outperform their alternatives in numerical comparisons. In applications they proved capable of confirming known biological findings, which validates the methods, and also generated new hypotheses. The applications included exploration of yeast gene expression data, yeast gene expression data in a new metric learned with auxiliary data, the regulation of yeast gene expression by transcription factors, and the dependencies between human and mouse gene expression.

This thesis consists of an overview and of the following 9 publications:

Samuel Kaski, Janne Nikkilä, and Teuvo Kohonen. Methods for Exploratory Cluster Analysis. In: Szczepaniak, Segovia, Kacprzyk, Zadeh (Eds.): Intelligent Exploration of the Web, pp. 136-151, Springer, Berlin, 2003.
Janne Nikkilä, Petri Törönen, Samuel Kaski, Jarkko Venna, Eero Castrén, and Garry Wong. Analysis and Visualization of Gene Expression Data using Self-Organizing Maps. Neural Networks, Special Issue on New Developments on Self-Organizing Maps, vol. 15, issue 8-9, pages 953-966, 2002.
Samuel Kaski, Janne Sinkkonen, and Janne Nikkilä. Clustering Gene Expression Data by Mutual Information with Gene Function. In: Dorffner, Bischof, Hornik (Eds.): Proceedings of the International Conference on Artificial Neural Networks (ICANN 2001), pages 81-86, Springer-Verlag, Berlin, Germany, 2001.
Merja Oja, Janne Nikkilä, Petri Törönen, Garry Wong, Eero Castrén, and Samuel Kaski. Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains. In: Zhang and Shmulevich (Eds.): Computational And Statistical Approaches To Genomics, pages 65-78, Kluwer Academic Publishers, 2002.
Janne Sinkkonen, Samuel Kaski, and Janne Nikkilä. Discriminative Clustering: Optimal Contingency Tables by Learning Metrics. In: Elomaa, Mannila, Toivonen (Eds.): Proceedings of the 13th European Conference on Machine Learning (ECML 2002), Lecture Notes in Artificial Intelligence 2430, pages 418-430, Springer, Berlin, 2002.
Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and Metrics in Visualizing Similarity of Gene Expression. BMC Bioinformatics, 4: 48, 2003. © 2003 by authors.
Samuel Kaski, Janne Nikkilä, Eerika Savia, and Christophe Roos. Discriminative Clustering of Yeast Stress Response. In: Seiffert, Jain, Schweizer (Eds.): Bioinformatics using Computational Intelligence Paradigms, pages 75-92, Springer, Berlin, 2005.
Samuel Kaski, Janne Nikkilä, Janne Sinkkonen, Leo Lahti, Juha Knuuttila, and Christophe Roos. Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Special Issue on Machine Learning for Bioinformatics - Part 2, vol. 2, no. 3, pages 203-216, July-September 2005. © 2005 IEEE. By permission.
Janne Nikkilä, Christophe Roos, Eerika Savia, and Samuel Kaski. Exploratory Modeling of Yeast Stress Response and its Regulation with gCCA and Associative Clustering. International Journal of Neural Systems, Special Issue on Bioinformatics, vol. 15, no. 4, pages 237-246, 2005. © 2005 World Scientific Publishing Company. By permission.

Keywords: bioinformatics, clustering, data integration, dependency modeling, exploratory data analysis, gene expression, human, learning metrics, mouse, self-organizing map, systems biology, transcription, yeast

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

Last update 2011-05-26

The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Exploratory Cluster Analysis of Genomic High-Throughput Data Sets and Their Dependencies

Janne Nikkilä

Abstract