The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Randomization Algorithms for Assessing the Significance of Data Mining Results

Markus Ojala

Doctoral dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the School of Science for public examination and debate in Auditorium E at the Aalto University School of Science (Espoo, Finland) on the 12^th of November 2011 at 12 noon.

Overview in PDF format (ISBN 978-952-60-4323-4) [1100 KB]
Dissertation is also available in print (ISBN 978-952-60-4322-7)

Abstract

Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting.

In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data.

In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification.

The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results.

This thesis consists of an overview and of the following 5 publications:

Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. 2009. Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining, volume 2, number 4, pages 209-230. © 2009 Wiley Periodicals. By permission.
Markus Ojala. 2010. Assessing data mining results on matrices with randomization. In: Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu (editors). Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010). Sydney, Australia. 14-17 December 2010. IEEE. Pages 959-964. ISBN 978-1-4244-9131-5. © 2010 Institute of Electrical and Electronics Engineers (IEEE). By permission.
Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In: John Elder, Françoise Soulié Fogelman, Peter Flach, and Mohammed Zaki (editors). Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009). Paris, France. 28 June - 1 July 2009. New York, NY, USA. ACM. Pages 379-388. ISBN 978-1-60558-495-9. © 2009 Association for Computing Machinery (ACM). By permission.
Markus Ojala, Gemma C. Garriga, Aristides Gionis, and Heikki Mannila. 2010. Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM International Conference on Data Mining (SDM 2010). Columbus, Ohio, USA. 29 April - 1 May 2010. Society for Industrial and Applied Mathematics. Pages 906-917. © 2010 Society for Industrial and Applied Mathematics (SIAM). By permission.
Markus Ojala and Gemma C. Garriga. 2010. Permutation tests for studying classifier performance. Journal of Machine Learning Research, volume 11, pages 1833-1863. © 2010 by authors.

Keywords: data mining, randomization, significance testing, MCMC, matrix, relational database, clustering, classification, iterative analysis

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

Last update 2012-02-03

The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.

Randomization Algorithms for Assessing the Significance of Data Mining Results

Markus Ojala

Abstract