The doctoral dissertations of the former Helsinki University of Technology (TKK) and Aalto University Schools of Technology (CHEM, ELEC, ENG, SCI) published in electronic format are available in the electronic publications archive of Aalto University - Aaltodoc.
Aalto

Text Mining with the WEBSOM

Krista Lagus

Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Computer Science and Engineering for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 11th of December, 2000, at 12 o'clock noon.

Overview in PDF format (ISBN 951-22-5260-0)   [1085 KB]
Dissertation is also available in print (ISBN 951-666-556-X)

Abstract

The emerging field of text mining applies methods from data mining and exploratory data analysis to analyzing text collections and to conveying information to the user in an intuitive manner. Visual, map-like displays provide a powerful and fast medium for portraying information about large collections of text. Relationships between text items and collections, such as similarity, clusters, gaps and outliers can be communicated naturally using spatial relationships, shading, and colors.

In the WEBSOM method the self-organizing map (SOM) algorithm is used to automatically organize very large and high-dimensional collections of text documents onto two-dimensional map displays. The map forms a document landscape where similar documents appear close to each other at points of the regular map grid. The landscape can be labeled with automatically identified descriptive words that convey properties of each area and also act as landmarks during exploration. With the help of an HTML-based interactive tool the ordered landscape can be used in browsing the document collection and in performing searches on the map.

An organized map offers an overview of an unknown document collection helping the user in familiarizing herself with the domain. Map displays that are already familiar can be used as visual frames of reference for conveying properties of unknown text items. Static, thematically arranged document landscapes provide meaningful backgrounds for dynamic visualizations of for example time-related properties of the data. Search results can be visualized in the context of related documents.

Experiments on document collections of various sizes, text types, and languages show that the WEBSOM method is scalable and generally applicable. Preliminary results in a text retrieval experiment indicate that even when the additional value provided by the visualization is disregarded the document maps perform at least comparably with more conventional retrieval methods.

This thesis consists of an overview and of the following 8 publications:

  1. Lagus, K., Kaski, S., Honkela, T., and Kohonen, T. (1996). Browsing digital libraries with the aid of self-organizing maps. Proceedings of the Fifth International World Wide Web Conference WWW5, May 6-10, Paris, France, pp. 71-79. © 1996 authors.
  2. Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. (1996). Self-organizing maps of document collections: a new approach to interactive exploration. In Simoudis, E., Han, J., and Fayyad, U., editors, Proceedings of the Second International Conference on Knowledge Discovery & Data Mining (KDD'96), pp. 238-243. AAAI Press, Menlo Park, CA. © 1996 AAAI. Reprinted with permission.
  3. Lagus, K. (1998) Generalizability of the WEBSOM method to document collections of various types. In Proceedings of 6th European Congress on Intelligent Techniques & Soft Computing (EUFIT'98), vol. 1, pp. 210-214, Verlag Mainz, Aachen, Germany. © 1998 authors.
  4. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). WEBSOM – self-organizing maps of document collections. Neurocomputing, vol. 21, pp. 101-117. © 1998 Elsevier Science. Reprinted with permission.
  5. Lagus, K. and Kaski, S. (1999) Keyword selection method for characterizing text document maps. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN'99), vol. 1, pp. 371-376. IEE Press, London. © 1999 IEE. Reprinted with permission.
  6. Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. (1999). WEBSOM for textual data mining. Artificial Intelligence Review, vol. 13, issue 5/6, pp. 345-364. © 1999 Kluwer Academic Publishers. Reprinted with permission.
  7. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., and Saarela, A. (2000). Self organization of a massive text document collection. IEEE Transactions on Neural Networks, Special Issue on Neural Networks for Data Mining and Knowledge Discovery, vol. 11, pp. 574-585. © 2000 IEEE. Reprinted with permission.
    Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
  8. Lagus, K. (2000). Text retrieval using self-organized document maps. Technical Report A61, Helsinki University of Technology, Laboratory of Computer and Information Science. ISBN 951-22-5145-0. © 2000 author.

Keywords: self-organizing map, document maps, visual user interfaces, information exploration, text retrieval, large text collections

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

© 2000 Helsinki University of Technology


Last update 2011-05-26