General Purpose Methods for Unsupervised Exploration of Large Datasets

Daniel Boley

Computer Science and Engineering
University of Minnesota

Contact Information

Daniel Boley - PI
Computer Science and Engineering
University of Minnesota
4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: my last name at cs.umn.edu

WWW PAGE

List of Supported Students and Staff (optional)

Graduate student(s) : David Littau;; Dongwei Cao

Project Award Information

Keywords

Unsupervised Clustering, Scalable algorithms, Sparse data representation, Tree structure.

Project Summary

The main goals of this project can be divided into (1) theoretical developments, (2) development of agents and interfaces, and (3) extension to a variety of application domains. We have begun activities in each of these areas.

We are in the process of carrying out a theoretical and algorithmic development of a version of Principal Direction Divisive Partitioning (PDDP) to datasets which are too large to fit in main memory at once. For this activity we avoid the simplistic approach of simply replacing the data samples with a small number of "representatives." Such representatives might, for example, be computed via a standard clustering algorithm, and each representative would represent many original data items. Instead we maintain an individual representation for each separate original data sample, but which takes substantially less space than the original data. As a consequence, when the entire dataset is clustered, data samples are not necessarily locked with their neighbors sharing the same representatives. Hence outliers can be handled in a more clean way. The paradigm is easily adapted to data streams, with fast updating as new data arrives, and optionally as old data is purged, and it enjoys a subtantial amount of parallelism.

Under the heading of agent development, we are continuing a project of integrating the clustering process with a collection of web browser tools to facilitate the navigation through a large number of web pages.

Under the heading of new application areas, we are applying a variety of classification and clustering methods to problems in the visual identification of human activities. Using techniques such as clustering (e.g. PDDP), classification using kernel methods (e.g. Support Vector Machines (SVMs)), and dimensionality reduction methods (e.g. Principal Component Analysis (PCA)), we use sequences of visual images to recognize human activities, such as walking and running. This project, joint with the Univ. of Minnesota Artificial Intelligence and Robotics Laboratory, has applications in Homeland Security as well as theoretical development of machine intelligence.

Publications and Products

Software

Project Impact

Goals, Objectives, and Targeted Activities

Discoveries at and across the frontier of science and engineering: We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline. Examples have included the predictive toxicology of organic molecules (joint with Inst. Mario Negri (Milan, Italy), automated organization of alcohol-related bills for the MN School of Public Health.

Project References

Area Background

This project depends on the combination of several widely different fields which have come together to produce methods which are extremely successful. The foundation of the techniques -- the ideas that lead to the practicality and the scalability -- are methods for large sparse eigenvalue problems in numerical linear algebra. Specifically, the core computation is a so-called Lanczos-based method for computing the leading principal component for a large collection of sparse vectors each of which represents a document (or other data sample). The methods preserve the inherent sparsity making it possible to represent large data collections in relatively little space, and also making it possible to carry out the computations in a relatively short time. These underlying methods serve as a foundation for a variety of divisive partitioning techniques, which are analogous to the classical agglomerative clustering techniques. Instead of coalescing the individual data samples one by one into clusters (e.g. with a nearest neighbor criterion), we start with all the data samples in one big cluster and split it repeatedly along the principal component. The splitting process yields weights indicating which features are most critical in placing data samples in one cluster as opposed to another cluster. The utility of these indications is currently underdeveloped.

Area References