Using a Low-Memory Factored Representation to Data Mine Large Data Sets

Date of Submission: 
May 3, 2006
Report Number: 
06-015
Report PDF: 
Abstract: 
As data sets have continued to grow in size, it has become more difficult to mine them efficiently. A data set may be dynamic, or may be static but very large. The data could be arriving in a stream, such that the total number of data points is finite and unknown or infinite. Many data mining methods require multiple passes over the data, which means that the data must reside in memory to keep costs low. One alternative to mining the original data is to create a representation of the data set which occupies less memory, while permitting the incorporation of new data. To this end, we present the Low-Memory Factored Representation (LMFR). The LMFR is an alternate representation of the data which uses less memory, while maintaining a unique representation for each data point. Since the LMFR provides a general representation of the data, the data mining task does not need to be determined before the LMFR is constructed. For streaming data, the LMFR allows for many more data points to be exposed a given method at once. Furthermore, the LMFR can be re-computed on the fly to lower its memory footprint without re-examining the original data. We demonstrate experimentally that the LMFR can be an efficient replacement for the original data when clustering using Principal Direction Divisive Partitioning (PDDP). This new clustering method, Piecemeal PDDP (PMPDDP), maintains the scalability and quality of a PDDP clustering while extending PDDP to much larger data sets. We demonstrate experimentally that the LMFR can be used for document retrieval. A given LMFR can achieve retrieval accuracy superior to the truncated singular value decomposition of a given rank, while being less expensive to compute and occupying less memory. We also develop an application of the LMFR to streaming data mining, and show experimentally the cost of various techniques which lower the memory footprint of the LMFR. The overall results indicate that the LMFR has many possible applications in general data mining tasks.