## The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc. | |

Dissertation for the degree of Doctor of Science in Technology to be presented
with due permission of the Department of Computer Science and Engineering
for public examination and debate in Auditorium T2 at Helsinki University of
Technology (Espoo, Finland) on the 31^{st} of May, 2006, at 12 o'clock noon.

Dissertation in PDF format (ISBN 951-22-8202-X) [4541 KB]

Dissertation is also available in print (ISBN 951-22-8201-1)

Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable.

The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.

**Keywords:**
data mining, frequent itemset, query selectivity, Boolean
formulas, Bonferroni inequality, error-tolerant itemset

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2006 Helsinki University of Technology

Last update 2011-05-26