Hierarchical Taxonomies using Divisive Partitioning

Date of Submission: 
March 17, 1998
Report Number: 
98-012
Abstract: 
We propose an unsupervised divisive partitioning algorithm for document data sets which enjoys many favorable properties. In particular, the algorithm shows excellent scalability to large data collections and produces high quality clusters which are competitive with other clustering methods. The algorithm yields information on the significant and distinctive words within each cluster, and these words can be inserted into the naturally occuring hierarchical structure produced by the algorithm. The result is an automatically generated hierarchical topical taxonomy of a document set. In this paper, we show how the algorithm's cost scales up lineraly with the size of the data, illustrate experimentally the quality of the clusters produced, and show how the algorithm can produce a hierarchical topical taxonomy.