The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Algorithms for Finding Orders and Analyzing Sets of Chains

Antti Ukkonen

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Faculty of Information and Natural Sciences for public examination and debate in Auditorium TU1 at Helsinki University of Technology (Espoo, Finland) on the 4th of June, 2008, at 12 noon.

Dissertation in PDF format (ISBN 978-951-22-9374-2)   [2036 KB]
Dissertation is also available in print (ISBN 978-951-22-9373-5)

Abstract

Rankings of items are a useful concept in a variety of applications, such as clickstream analysis, some voting methods, bioinformatics, and other fields of science such as paleontology. This thesis addresses two problems related to such data. The first problem is about finding orders, while the second one is about analyzing sets of orders.

We address two different tasks in the problem of finding orders. We can find orders either by computing an aggregate of a set of known orders, or by constructing an order for a previously unordered data set. For the first task we show that bucket orders, a subclass of partial orders, are a useful structure for summarizing sets of orders. We formulate an optimization problem for finding such partial orders, show that it is NP-hard, and give an efficient randomized algorithm for finding approximate solutions to it. Moreover, we show that the expected cost of a solution found by the randomized algorithm differs from the optimal solution only by a constant factor. For the second approach we propose a simple method for sampling orders for 0–1 vectors that is based on the consecutive ones property.

For analyzing orders, we discuss three different methods. First, we give an algorithm for clustering sets of orders. The algorithm is a variant of Lloyd's iteration for solving the k-means problem. We also give two different approaches for mapping orders to vectors in a high-dimensional Euclidean space. These mappings are used on one hand for clustering, and on the other hand for creating two dimensional visualizations (scatterplots) for sets of orders. Finally, we discuss randomization testing in case of orders. To this end we propose an MCMC algorithm for creating random sets of orders that preserve certain well defined properties of a given set of orders. The random data sets can be used to assess the statistical significance of the results obtained e.g. by clustering.

Keywords: data analysis, ranking, approximation algorithm, clustering, visualization, dimensionality reduction, MCMC, randomization testing

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2008 Helsinki University of Technology


Last update 2011-05-26