General Purpose Methods for Unsupervised Exploration of Large Datasets
Daniel Boley
Computer Science and Engineering
University of Minnesota
Contact Information
| Computer Science and Engineering University of Minnesota 4-192 EE/CS, 200 Union St. SE, Minneapolis, MN 55455 |
Phone: (612) 625-3887 Fax: (612) 625-0572 Email: my last name at cs.umn.edu |
WWW PAGE
List of Supported Students and
Staff (optional)
Graduate student(s) : David Littau;; Dongwei Cao
Project Award Information
Keywords
Unsupervised Clustering, Scalable algorithms, Sparse data representation, Tree structure.
Project Summary
We are in the process of carrying out a theoretical and algorithmic development of a version of Principal Direction Divisive Partitioning (PDDP) to datasets which are too large to fit in main memory at once. For this activity we avoid the simplistic approach of simply replacing the data samples with a small number of "representatives." Such representatives might, for example, be computed via a standard clustering algorithm, and each representative would represent many original data items. Instead we maintain an individual representation for each separate original data sample, but which takes substantially less space than the original data. As a consequence, when the entire dataset is clustered, data samples are not necessarily locked with their neighbors sharing the same representatives. Hence outliers can be handled in a more clean way. The paradigm is easily adapted to data streams, with fast updating as new data arrives, and optionally as old data is purged, and it enjoys a subtantial amount of parallelism.
Under the heading of agent development, we are continuing a project of integrating the clustering process with a collection of web browser tools to facilitate the navigation through a large number of web pages.
Under the heading of new application areas, we are applying a variety of classification and clustering methods to problems in the visual identification of human activities. Using techniques such as clustering (e.g. PDDP), classification using kernel methods (e.g. Support Vector Machines (SVMs)), and dimensionality reduction methods (e.g. Principal Component Analysis (PCA)), we use sequences of visual images to recognize human activities, such as walking and running. This project, joint with the Univ. of Minnesota Artificial Intelligence and Robotics Laboratory, has applications in Homeland Security as well as theoretical development of machine intelligence.
Publications and Products
Project Impact
Goals, Objectives, and Targeted Activities
Discoveries at and across the frontier of science and engineering: We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline. Examples have included the predictive toxicology of organic molecules (joint with Inst. Mario Negri (Milan, Italy), automated organization of alcohol-related bills for the MN School of Public Health.
Project References
Area Background
This project depends on the combination of several widely different fields which have come together to produce methods which are extremely successful. The foundation of the techniques -- the ideas that lead to the practicality and the scalability -- are methods for large sparse eigenvalue problems in numerical linear algebra. Specifically, the core computation is a so-called Lanczos-based method for computing the leading principal component for a large collection of sparse vectors each of which represents a document (or other data sample). The methods preserve the inherent sparsity making it possible to represent large data collections in relatively little space, and also making it possible to carry out the computations in a relatively short time. These underlying methods serve as a foundation for a variety of divisive partitioning techniques, which are analogous to the classical agglomerative clustering techniques. Instead of coalescing the individual data samples one by one into clusters (e.g. with a nearest neighbor criterion), we start with all the data samples in one big cluster and split it repeatedly along the principal component. The splitting process yields weights indicating which features are most critical in placing data samples in one cluster as opposed to another cluster. The utility of these indications is currently underdeveloped.
Area References