TITLE:

Spatial Data Mining and Environmental Sciences

PRESENTER:

Shashi Shekhar : Biography , Homepage , Picture

AFFILIATION:

Computer Science Department, University of Minnesota.

URL:

http://www.cs.umn.edu/~shekhar

SLIDES:

ABSTRACT:

It is critical to monitor and predict where and when large contaminant fluxes will occur so that actions may be taken to protect environment and limit exposure to human and aquatic life. Current water quality monitoring is based on infrequent (e.g., weekly) sampling and time-consuming (e.g., hours to days) testing methods, making it difficult to make timely decisions to protect watersheds, a crucial part of our environment. Recent advances have led to use of monitoring networks based on sensors to provide increased sampling frequency as well as digital watershed data warehouses to manage the sensor data. However, key challenges remain. Researchers need new models that they can apply to take full advantage of these new types of data sets, as current models do not adequately account for the huge quantities of data collected and the new patterns that are observable as a result. The goal of this project is to advance new scalable spatio-temporal data mining tools and protocols for monitoring, detecting, and predicting contamination of environment. Classical and spatial data mining ideas are generalized to represent and analyze data sets related to physical processes such as water flow and contaminant flow using novel methods such as flow anomaly detection and teleconnection detection.

Given a percentage-threshold and readings from a pair of consecutive upstream and downstream sensors, flow anomaly discovery identifies dominant time intervals where the fraction of time instants of significantly mis-matched sensor readings exceed the given percentage-threshold. Discovering flow anomalies (FA) is an important problem in environmental flow monitoring networks and early warning detection systems for water quality problems. However, mining FAs is computationally expensive because of the large (potentially infinite) number of time instants of measurement and potentially long delays due to stagnant (e.g. lakes) or slow moving (e.g. wetland) water bodies between consecutive sensors. Traditional outlier detection methods (e.g. t-test) are suited for detecting transient FAs (i.e., time instants of significant mis-matches across consecutive sensors) and cannot detect persistent FAs (i.e., long variable time-windows with a high fraction of time instant transient FAs) due to a lack of a pre-defined window size. In contrast, we propose a Smart Window Enumeration and Evaluation of persistence-Thresholds (SWEET) method to efficiently explore the search space of all possible window lengths. Computation overhead is brought down significantly by restricting the start and end points of a window to coincide with transient FAs, using a smart counter and efficient pruning techniques. Experimental evaluation using a real dataset shows our proposed approach outperforms Naive alternatives.

KEYWORDS: Spatial Datasets, Auto-correlation, Spatial data mining.

ACKNOWLEDGMENTS: This work was supported in part by the National Science Foundation, the U.S. Department of Defense, and the University of Minnesota.

NOTE: Some of the results discussed in this talk appeared in the following publications:

  1. James M. Kang, Shashi Shekhar, Christine Wennen, Paige Novak, Discovering Flow Anomalies: A SWEET Approach, icdm, pp.851-856, 2008 Eighth IEEE International Conference on Data Mining, 2008.
  2. James M. Kang, Shashi Shekhar, Michael Henjum, Paige J. Novak and William A. Arnold, Discovering Teleconnected Flow Anomalies: A Relationship Analysis of Dynamic Neighborhoods (RAD) Approach , Proc. Intl. Symp. on Adv. in Spatial and Temporal Databases, Springer LNCS 5644, 2009.