A Segment-based Approach To Clustering Multi-Topic Documents

Date of Submission: 
January 31, 2008
Report Number: 
Report PDF: 
Document clustering has been recognized as a central problem in text data management, and it becomes particularly challenging when documents have multiple topics. In this paper we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments, which bear one or more topics on their own. We propose a segment-based document clustering framework, which is designed to induce a classification of documents starting from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our approach on different, large collections of multi-topic documents.