Summarization - Compressing Data into an Informative Representation Report

Date of Submission: 
June 8, 2005
Report Number: 
05-024
Report PDF: 
Abstract: 
Summarization is an important problem in many domains involving large datasets. Summarization can be essentially viewed as transformation of data into a concise yet meaningful representation which could be used for efficient storage or manual inspection. In this paper, we formulate the problem of summarization of a large dataset of transactions as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We propose data mining techniques to obtain a summary for a given set of transactions while optimizing these two objective functions. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a meaningful representation. We first present a direct application of a standard clustering scheme to generate summaries. We then show how this could be significantly improved by using a multi-step approach which involves generating candidate summaries for a dataset using association analysis and then choosing a subset of these candidates as the summary with the desired compaction and information content. We present results of experiments conducted on real and artificial datasets to demonstrate the effectiveness of our techniques.