The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Graphical Models for Biclustering and Information Retrieval in Gene Expression Data

José Caldas

Doctoral dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the School of Science for public examination and debate in Auditorium T2 at the Aalto University School of Science (Espoo, Finland) on the 20th of April 2012 at 12 noon.

Overview in PDF format (ISBN 978-952-60-4559-7)   [634 KB]
Dissertation is also available in print (ISBN 978-952-60-4558-0)

Abstract

The cell coordinates its biological response to the environment partly via the selective synthesis of thousands of unique RNA and protein molecules. Understanding the molecular biology of the cell is thus essential to the advancement of areas such as health care, agriculture, and energy production, but requires the ability to simultaneously acquire information about thousands of molecules in a sample. Recent high-throughput measurement technologies address this concern. While being useful, they generate a high volume of data and bring in methodological challenges, effectively shifting the bottleneck in molecular biology research from data acquisition to data analysis. In particular, an important challenge is the genome-wide analysis of how RNA is transcribed under different conditions, organisms, and tissues, a process known as gene expression.

When developing computational methods for biological data analysis tasks, probabilistic frameworks constitute promising approaches due to their flexibility, soundness, and ability to handle noisy data. In this thesis, the contributions are in the development of probabilistic methods for two relevant tasks in genome-wide gene expression analysis, namely biclustering and information retrieval.

Biclustering concerns the simultaneous grouping of objects, e.g., genes, and conditions. The first contribution is the development of a Bayesian extension to an existing biclustering model. The second contribution is a novel probabilistic method that allows deriving a hierarchical organization of microarrays in a gene expression data set and at the same time indicate the genes that characterize the hierarchy. Finally, the third contribution is a general probabilistic biclustering framework that easily lends itself to different data types and model assumptions.

Information retrieval in gene expression data is needed because of the increasing amount of available data stored in public databases. Two probabilistic methods for information retrieval are proposed. The models are used in a series of biological case studies that show how the proposed approaches have the potential to accelerate biological research by jointly analyzing data from different studies. In particular, several connections between biological conditions found by the models either correspond to existing biological knowledge or were used in a confirmatory follow-up study to obtain novel biological findings.

This thesis consists of an overview and of the following 5 publications:

  1. José Caldas and Samuel Kaski. Bayesian biclustering with the plaid model. In Proceedings of the 2008 IEEE International Workshop on Machine Learning for Signal Processing XVIII, José Príncipe, Deniz Erdogmus, and Tulay Adali (editors), pages 291-296, IEEE, Piscataway, N.J., October 2008.
  2. José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12):i145-i153 (ISMB/ECCB 2009 Conference Proceedings), June 2009.
  3. José Caldas and Samuel Kaski. Hierarchical generative biclustering for microRNA expression analysis. Journal of Computational Biology, 18(3):251-261 (RECOMB 2010 Special Issue), March 2011.
  4. José Caldas and Samuel Kaski. A mixture-of-experts approach to biclustering. Submitted to a journal, 10 pages, 2011.
  5. José Caldas, Nils Gehlenborg, Eeva Kettunen, Ali Faisal, Mikko Rönty, Andrew G. Nicholson, Sakari Knuutila, Alvis Brazma, and Samuel Kaski. Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. Bioinformatics, 28(2):246-253, January 2012.

Keywords: probabilistic modelling, Bayesian network, biclustering, information retrieval, transcriptomics

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2012 Aalto University


Last update 2012-10-31