The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Methods for Exploring Genomic Data Sets: Application to Human Endogenous Retroviruses

Merja Oja

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Computer Science and Engineering for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 14th of December, 2007, at 12 o'clock noon.

Overview in PDF format (ISBN 978-951-22-9062-8)   [1975 KB]
Dissertation is also available in print (ISBN 978-951-22-9061-1)

Abstract

In this thesis exploratory data analysis methods have been developed for analyzing genomic data, in particular human endogenous retrovirus (HERV) sequences and gene expression data. HERVs are remains of ancient retrovirus infections and now reside within the human genome. Little is known about their functions. However, HERVs have been implicated in some diseases. This thesis provides methods for analyzing the properties and expression patterns of HERVs.

Nowadays the genomic data sets are so large that sophisticated data analysis methods are needed in order to uncover interesting structures in the data. The purpose of exploratory methods is to help in generating hypotheses about the properties of the data. For example, by grouping together genes behaving similarly, and hence presumably having similar function, a new function can be suggested for previously uncharacterized genes. The hypotheses generated by exploratory data analysis can be verified later in more detailed studies. In contrast, a detailed analysis of all the genes of an organism would be too time consuming and expensive.

In this thesis self-organizing map (SOM) based exploratory data analysis approaches for visualization and grouping of gene expression profiles and HERV sequences are presented. The SOM-based analysis is complemented with estimates on reliability of the SOM visualization display. New measures are developed for estimating the relative reliability of different parts of the visualization. Furthermore, methods for assessing the reliability of groups of samples manually extracted from a visualization display are introduced.

Finally, a new computational method is developed for a specific problem in HERV biology. Activities of individual HERV sequences are estimated from a database of expressed sequence tags using a hidden Markov mixture model. The model is used to analyze the activity patterns of HERVs.

This thesis consists of an overview and of the following 7 publications:

  1. Merja Oja, Janne Nikkilä, Petri Törönen, Garry Wong, Eero Castrén, and Samuel Kaski. Exploratory clustering of gene expression profiles of mutated yeast strains. In Wei Zhang and Ilya Shmulevich, editors, Computational and Statistical Approaches to Genomics, pages 65-78. Kluwer, Boston, MA, 2002.
  2. Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinformatics, 4: 48, 2003. © 2003 by authors.
  3. Merja Oja, Panu Somervuo, Samuel Kaski, and Teuvo Kohonen. Clustering of human endogenous retrovirus sequences with median self-organizing map. In Proceedings of the 4th Workshop on Self-Organizing Maps (WSOM 2003), 11-14 September 2003, Hibikino, Japan, on CD-ROM. © 2003 WSOM'03 Organizing Committee. By permission.
  4. Merja Oja, Göran Sperber, Jonas Blomberg, and Samuel Kaski. Grouping and visualizing human endogenous retroviruses by bootstrapping median self-organizing maps. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), 7-8 October 2004, San Diego, USA, pages 95-101. © 2004 IEEE. By permission.
  5. Merja Oja, Göran O. Sperber, Jonas Blomberg, and Samuel Kaski. Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups. International Journal of Neural Systems, 15 (3): 163-179, 2005. © 2005 by authors and © 2005 World Scientific Publishing Company. By permission.
  6. Merja Oja, Jaakko Peltonen, Jonas Blomberg, and Samuel Kaski. Methods for estimating human endogenous retrovirus activities from EST databases. BMC Bioinformatics, 8 (Suppl. 2): S11, 2007. © 2007 by authors.
  7. Merja Oja. In silico expression profiles of human endogenous retroviruses. In Proceedings of the Second IAPR International Workshop on Pattern Recognition in Bioinformatics (PRIB 2007), 1-2 October 2007, Singapore, Lecture Notes in Bioinformatics, volume 4774, pages 253-263, 2007. © 2007 by author and © 2007 Springer Science+Business Media. By permission.

Keywords: bioinformatics, exploratory data analysis, gene expression, hidden Markov model, human endogenous retrovirus, information visualization, learning metrics, reliability, self-organizing map

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2007 Helsinki University of Technology


Last update 2011-05-26