Document Categorization and Query Generation on the World Wide Web Using WebACE

Date of Submission: 
January 30, 1998
Report Number: 
98-006
Abstract: 
We present WebACE, an agent for exploring and categorizingdocuments on the World Wide Web based on a user profile.The heart of the agent is an unsupervised categorization ofaset of documents, combined with a process for generating newqueries that is used to search for new related documents andfor filtering the resulting documents to extract the onesmost closely related to the starting set. The documentcategories are not given {it a priori}. We present theoverall architecture and describe two novel algorithms whichprovide significant improvement over traditional clusteringalgorithms and form the basis for the query generation andsearch component of the agent. We report on the results ofourexperiments comparing these new algorithms with moretraditionalclustering algorithms and we show that our algorithms arefastand scalable.