NSF Grant IIS-0308264 Data Mining for Rare Class Analysis

Data Mining for Rare Class Analysis

National Science Foundation Award Number: IIS-0308264 (September 1, 2003 - August 31, 2007)



Contact Information:

Vipin Kumar, PI
Department of Computer Science and Engineering
4-192, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625 0726
E-mail: kumar at cs.umn.edu     URL: http://www.cs.umn.edu/~kumar

Jaideep Srivastava, co-PI
Department of Computer Science and Engineering
5-209, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625 4012
E-mail: srivasta at cs.umn.edu     URL: http://www.cs.umn.edu/people/faculty/index.php?id=165

List of Supported Students and Staff:

Postdoctoral Researcher(s): Graduate Students:

Project Award Information:


Project Summary:

This project systematically addresses the rare class problem which is a important in building predictive models. It involves developing novel methods to select features to build predicvtive models, in the context of rare class learning. Specifically, a feature based approach has been developed to find local patterns and features using association analysis. It also involves developing new methods of predictive modeling that are specifically suited for rare classes. It also involves developing adaptive techniques for data streams, as many rare class analysis applications arrives as a set of time-oriented streams. Techniques have been developed for analyzing network traffic to detect scans aimed at identifying network vulnerabilities.

Research results from this project were highlighted twice (2004 and 2007) as NSF nuggets and were also highlighted on NSF's discovery webpage:
http://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=100488&org=NSF

Publications:

  1. San-Yih Hwang, Haojun Wang, Jaideep Srivastava, and Raymond A. Paul, A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows, Lecture Notes in Computer Science : Conceptual Modeling - ER 2004, pp. 596-609.
  2. M. Joshi and V. Kumar, CREDOS: Classification using Ripple Down Structure (A Case for Rare Classes), In Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, FL, April 2004.
  3. Aleksandar Lazarevic, Ramdev Kanapady, Chandrika Kamath, Vipin Kumar, and Kumar Tamma, Effective localized regression for damage detection in large complex mechanical structures, Proceedings of the Tenth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2004), Seattle, USA, 2004, pp 450-459.
  4. Mane, S.; Srivastava, J.; San-Yin Hwang; Vayghan, J., Estimation of false negatives in classification, Data Mining, 2004. ICDM 2004. Proceedings. Fourth IEEE International Conference on , vol., no.pp. 475- 478, 1-4 Nov. 2004
  5. Aysel Ozgur, Pang-Ning Tan, and Vipin Kumar, RBA: An Integrated Framework for Regression Based on Association Rules, Proc. 2004 SIAM International Conf. on Data Mining (SDM’04), Florida, USA, 2004.
  6. Steinbach, M., Tan, P., and Kumar, V. 2004. Support envelopes: a technique for exploring the structure of association patterns. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM Press, New York, NY, 296-305. DOI= http://doi.acm.org/10.1145/1014052.1014086.
  7. Steinbach, M., Tan, P., Xiong, H., and Kumar, V. 2004. Generalizing the notion of support. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM Press, New York, NY, 689-694.
  8. Hui Xiong, Michael Steinbach, Pang-Ning Tan, and Vipin Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, in Proc. 2004 SIAM International Conf. on Data Mining (SDM 2004), pp. 279 - 290, Florida, USA, 2004
  9. Chandola, V. and Kumar, V. 2005. Summarization — Compressing Data into an Informative Representation. In Proceedings of the Fifth IEEE international Conference on Data Mining (November 27 - 30, 2005). ICDM. IEEE Computer Society, Washington, DC, 98-105.
  10. Vipin Kumar, Parallel and Distributed Computing for Cyber Security. An article based on the keynote talk by the author at 17th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, Volume 6, number 10, October 2005.
  11. A. Lazarevic, V. Kumar, and J. Srivastava, A Survey of Intrusion Detections Systems, in Managing Cyber Threats: Issues, Approaches and Challenges, edited by V. Kumar, J. Srivastava, and A. Lazarevic, Kluwer Academic Publishers, May 2005.
  12. Aleksandar Lazarevic and Vipin Kumar, Feature Bagging for Outlier Detection, Proceedings of the Eleventh ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2005) Chicago, August, 2005.
  13. Sandeep Mane, Jamshid Vayghan, Jaideep Srivastava, Philip Yu and Gedas Adomavicius, Data Mining Techniques for Automated Evaluation of Sales Opportunities: A Case Study In International workshop on Customer Relationship Management: Data Mining Meets Marketing, 2005.
  14. Mane, S., Srivastava, J., and Hwang, S. Estimating missed actual positives using independent classifiers. In Proceedings of the Eleventh ACM SIGKDD international Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21 - 24, 2005). KDD '05. ACM Press, New York, NY, 648-653, 2005.
  15. J. Srivastava, P. Desikan, and V. Kumar, Web Mining – Concepts, Applications and Research Directions, Book Chapter in Recent Advances in Data Mining and Granular Computing (mathematical aspects of knowledge discovery), T.Y. Lin and Wesley Chu, eds., Springer-Verlag, expected 2005.
  16. Michael Steinbach and Vipin Kumar, Generalizing the Notion of Confidence, Fifth IEEE International Conference on Data Mining (ICDM' 05), pp 402-409, Houston, TX, 27-30 November, 2005. Also to appear in Knowledge and Information Systems.
  17. Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin Kumar, Stephen R. Holbrook, Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, in Proc. of the Pacific Symposium on Biocomputing, (PSB 2005), pp. 221-232, 2005.
  18. Jieping Ye, Qi Li, Hui Xiong, Haesun Park, Ravi Janardan, Vipin Kumar, IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition, IEEE Transactions on Knowledge and Data Engineering ,vol. 17, no. 9, pp. 1208-1222, September, 2005.
  19. S. Mane, P. Desikan, J. Srivastava, From Clicks to Bricks: CRM Lessons from E-commerce, 2nd Annual Statistical Challenges in Electronic Commerce Research Symposium, May 2006
  20. Gyorgy Simon, Eric Eilertson, Vipin Kumar, Zhi-Li Zhang and Hui Xiong, Scan Detection: A Data Mining Approach, Proceedings of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, Bethesda, MD
  21. Mark Shaneck, Yongdae Kim, Vipin Kumar, Privacy Preserving Nearest Neighbor Search, To appear in the 2006 IEEE International Workshop on Privacy Aspects of Data Mining, December, 2006
  22. Gyorgy J. Simon, Vipin Kumar, and Zhi-Li Zhang, Estimating False Negatives for Classification Problems with Cluster Structure, University of Minnesota Technical Report, October, 2006.
  23. Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 18, no. 3, pp. 304-319, March, 2006.
  24. Hui Xiong, Shashi Shekhar, Pang-Ning Tan, and Vipin Kumar, TAPER: A Two-Step Approach for All-strong-pairs Correlation Query in Large Databases, IEEE Transactions on Knowledge and Data Engineering (TKDE), Volume 18, Number 4, pp. 493-508, April 2006.
  25. Hui Xiong, Michael Steinbach, Arafin Ruslim, Vipin Kumar, Characterizing Pattern Preserving Clustering, Submitted to ACM Transactions on Information Systems, 2006.
  26. Hui Xiong, Michael Steinbach, and Vipin Kumar, Privacy Leakage in Multi-relational Databases: A Semi-supervised Learning Perspective, VLDB Journal Special Issue on Privacy Preserving Data Management , Vol. 15, No. 4, pp. 388-402, November, 2006.
  27. Xiong, H., Tan, P., and Kumar, V. 2006. Hyperclique pattern discovery. Data Min. Knowl. Discov. 13, 2 (Sep. 2006), 219-242.
  28. Varun Chandola, Shyam Boriah, and Vipin Kumar. Similarity Measures for Categorical Data--A Comparative Study. University of Minnesota Technical Report 07-022, October, 2007.
  29. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Outlier Detection - A Survey. University of Minnesota Technical Report 07-017, August, 2007.
  30. Sandeep Mane and Jaideep Srivastava, False Negative Estimation and Feature Subsets Selection, submitted to IEEE TKDE in Dec, 2007.
  31. Gyorgy Simon, Vipin Kumar, and Zhi-Li Zhang, Estimating False Negatives for Classification Problems with Cluster Structure, University of Minnesota Technical Report, 2007. Also published in Proceedings of the Seventh SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis, MN

Research Contributions:

New algorithms and techniques have been discovered for the classification of rare classes and these techniques have been applied in a number of areas. Specific techniques include hierarchical classification models, local regression (RBA), outlier detection techniques, feature bagging, the hyperclique association pattern finding algorithm, methods for estimating true and false positives, and scan detection algorithms. Some of the work on scan detection has been incorporated into the Minnesota Intrusion Detection System (MINDS), which is the subject of a National Science Foundation's (NSF) news clip.

Contributions to Resources for Research and Education:

  1. Kumar has co-authored the following introductory textbook on data mining:
    Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley, ISBN: 0321321367, April 2005.
    The book appeared in print in Spring 2005 and has since been adopted extensively world-wide including major universities such as Stanford, University of Texas at Austin, UIUC, etc. The book has been translated in (or translation in progress) for several languages including Chinese, Portuguese, and Greek.
  2. Kumar and Srivastava have co-edited the following book on cyber security:
    Managing Cyber Threats: Issues, Approaches and Challenges, edited by V. Kumar, J. Srivastava, and A. Lazarevic, Springer, ISBN 0-387-24226-0, May 2005.
  3. Kumar, Srivastava, and Lazarevic have co-authored the following survey:
    A. Lazarevic, V. Kumar, and J. Srivastava, A Survey of Intrusion Detections Systems, in Managing Cyber Threats: Issues, Approaches and Challenges, edited by V. Kumar, J. Srivastava, and A. Lazarevic, Kluwer Academic Publishers, May 2005.
  4. Kumar is a co-author of a survey on protein function prediction.
    Gaurav Pandey, Vipin Kumar, and Michael Steinbach. Computational approaches for protein function prediction: A Survey, Technical Report 06-028, Department of Computer Science, University of Minnesota, October 2006. (Submitted to ACM Computing Surveys.)
    Rare classes are an important issue for protein function prediction since functional classes are often quite imbalanced.