Mining Spatial Data: The University of Minnesota Leads the Way


Front Row (left to right): Yan Huang, Wei-Hsin Fu, Shashi Shekhar, Wei-Li Wu;
Middle row: Xiaobin Ma, Hui Xiong, Chang-Tien Lu;
Back row: Ru-Lin (Alan) Liu, Pusheng (Alex) Zhang, Vatsavai Ranga Raju;
Not pictured: Caroline Peterson, Svetoslav Stoykov, Judy Djugash.

Spatial data mining software - code for pattern discovery processes based on geometric structures - is coming soon. When it arrives we will be able to efficiently identify spatial patterns in massive geo-spatial data sets for application domains including public safety (finding crime hot spots), public health (predicting the spread of disease), climatology (effects of El Niņo), ecology (protecting endangered species), transportation (detecting local instabilities in traffic), location-based ser-vices in the M(mobile)-commerce industry, and national defense (inferring enemy tactics such as flank attacks).

The spatial data mining software results from the research work of Professor Shashi Shekhar and his students. Their research has demonstrated that key assumptions of classical data mining techniques are invalid for geo-spatial data sets. Though classical data mining and spatial data mining share goals, their domains have different characteristics. First, spatial data is embedded in a continuous space, whereas classical data sets are often discrete. Second, spatial patterns are often local whereas classical data mining techniques often focus on global patterns. Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. When it comes to the analysis of spatial data, however, the assumption about the independence of samples is generally false because spatial data tends to be highly self-correlated. For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial auto-correlation. Ignoring spatial auto-correlation when analyzing data with spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set. Thus classical data mining algorithms often perform poorly when applied to spatial data sets. New methods are needed to analyze spatial data to detect spatial patterns.

Roots of spatial data mining lie in spatial statistics, spatial analysis, geographic information systems, machine learning, image analysis, and data mining. Several other departments at the University of Minnesota such as Biostatistics, Forest Resources, Electrical and Computer Engineering, Epidemiology, Geography and Psychology as well as research centers such as the Army High Performance Computing Research Center, the Center for Transportation Studies, the Institute for Mathematics and Its Applications (IMA), the Center for Urban and Regional Affairs, Precision Agriculture, and the Cancer Center are contributing to the field. In fact, the IMA is organizing a workshop on spatio-temporal patterns in the geosciences during the last week of September 2001, and a series of workshops related to the mathematics of geoscience in 2001-2002.

In addition to Shekhar, Computer Science and Engineering faculty members interested in spatial data mining include Professors Dan Boley, Ravi Janardan, George Karypis, Vipin Kumar, Nikos Papanikolopoulos and Paul Schrater. Prominent alumni in this field include Jack Dangermond (President, Environmental Systems Research Institute), Dr. Raju Namburu (Army Research Lab), and Dr. Siva Ravada (Manager, Spatial Data Group, Oracle Corporation).

The main contributions made by Shekhar and his students to spatial data mining include algorithms and data structures that can scale up to massive (terabytes to petabytes) data sets and formalization of new spatio-temporal patterns (e.g. co-locations) which were not explored by other research communities due to high computational complexity. Specific contributions include discovering spatial co-locations, detecting spatial outliers and location prediction.

The co-location pattern discovery process finds frequently co-located subsets of spatial event types given a map of their locations (see Figure1). For example, analysis of habitats of animals and plants may identify co-location of predator-prey species, symbiotic species, and fire events with ignition sources. Readers may find it interesting to analyze the map in Figure 1 to find co-location patterns. There are two co-location patterns of size 2 in this map. Shekhar's group has provided one of the most natural formulations as well as the first algorithms for discovering co-location patterns from large spatial data sets and applying it to climatology data from NASA.

Spatial outliers are significantly different from their neighborhood even though they may not be significantly different from the entire population. For example, a brand new house in an old neighborhood of a growing metropolitan area is a spatial outlier. Figure 2 shows another use of spatial outliers in traffic measurements for sensors on I-35W (north bound) in the Twin Cities metro area for a 24 hour time period. Sensor 9 seems to be a spatial outlier and may be a bad sensor. Note that the figure also shows three clusters of sensor behaviors, morning rush hour, busy daytime, and evening rush hour. Spatial statistics tests for detecting spatial outliers do not scale up to massive data sets such as the Twin Cities traffic data set measured at thousands of locations in 30-second intervals and archived for years. Shekhar's group generalized spatial statistics tests to spatio-temporal data sets and developed scalable algorithms for detecting spatial outliers in massive traffic data sets. This work built on the traditional strength in designing scalable hierarchical routing algorithms and efficient storage methods for roadmaps.

Location prediction is concerned with discovering a model to infer locations of a Boolean spatial phenomenon from the maps of other spatial features. For example, ecologists build models to predict habitats for endangered species using maps of vegetation, water bodies, climate and other related species. Maps of nest location, vegetation and water were used to build a location prediction model for nests of red-winged blackbirds in wetlands on the shores of Lake Erie in Ohio. Classical data mining techniques yield weak prediction models as they do not capture the auto-correlation in spatial data sets. Shekhar's group provided a formal comparison of diverse techniques from spatial statistics (e.g. spatial auto-regression) as well as image classification (e.g. Markov random field based Bayesian classifiers) and developed scalable algorithms for those.

Courses on Scientific Databases (CSci 8705, Fall 2001), Data Mining (a CSci seminar), and Spatial Biostatistics (PubH 8436) are wonderful opportunities to learn more about these topics. Professor Shekhar's upcoming book on Spatial Databases (publisher Prentice Hall, 2002) as well as web sites (www.cs.umn.edu/research/shashi-group and db.cs.sfu.ca/GeoMiner/) archiving recent research publications on the topic are additional resources.

-Shashi Shekhar & Bobbie Othmer