The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Advances in Mining Binary Data: Itemsets as Summaries

Nikolaj Tatti

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Faculty of Information and Natural Sciences for public examination and debate in Auditorium TU1 at Helsinki University of Technology (Espoo, Finland) on the 6th of June, 2008, at 12 noon.

Overview in PDF format (ISBN 978-951-22-9376-6)   [1898 KB]
Dissertation is also available in print (ISBN 978-951-22-9375-9)

Abstract

Mining frequent itemsets is one of the most popular topics in data mining. Itemsets are local patterns, representing frequently cooccurring sets of variables. This thesis studies the use of itemsets to give information about the whole dataset.

We show how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula. While this is a simple procedure given the original data, the task transforms into a computationally infeasible problem if we seek the solution using the itemsets. By making some assumptions of the structure of the itemsets and applying techniques from the theory of Markov Random Fields we are able to reduce the computational burden of query answering.

We can also use the known itemsets to predict the unknown itemsets. The difference between the prediction and the actual value can be used for ranking itemsets. In fact, this method can be seen as generalisation for ranking itemsets based on their deviation from the independence model, an approach commonly used in the data mining literature.

The next contribution is to use itemsets to define a distance between the datasets. We achieve this by computing the difference between the frequencies of the itemsets. We take into account the fact that the itemset frequencies may be correlated and by removing the correlation we show that our distance transforms into Euclidean distance between the frequencies of parity formulae.

The last contribution concerns calculating the effective dimension of binary data. We apply fractal dimension, a known concept that works well with real-valued data. Applying fractal dimension dimension directly is problematic because of the unique nature of binary data. We propose a solution to this problem by introducing a new concept called normalised correlation dimension. We study our approach theoretically and empirically by comparing it against other methods.

This thesis consists of an overview and of the following 5 publications:

  1. Nikolaj Tatti. 2007. Distances between data sets based on summary statistics. Journal of Machine Learning Research, volume 8, pages 131-154. © 2007 by author.
  2. Nikolaj Tatti. 2006. Computational complexity of queries based on itemsets. Information Processing Letters, volume 98, number 5, pages 183-187. © 2006 by author and © 2006 Elsevier Science. By permission.
  3. Nikolaj Tatti. 2006. Safe projections of binary data sets. Acta Informatica, volume 42, numbers 8-9, pages 617-638. © 2006 by author and © 2006 Springer Science+Business Media. By permission.
  4. Nikolaj Tatti. 2008. Maximum entropy based significance of itemsets. Knowledge and Information Systems (KAIS), accepted for publication. © 2008 Springer Science+Business Media. By permission.
  5. Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. 2006. What is the dimension of your binary data? In: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM 2006). Hong Kong, 18-22 December 2006, pages 603-612. © 2006 IEEE. By permission.

Keywords: data mining, frequent itemset, boolean queries, safe projections, itemset ranking

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2008 Helsinki University of Technology


Last update 2011-05-26