Configurable Web Crawler Homepage

  • The Configurable Web Crawler is a perl web crawler that is suited to retrieve any text information from http protocol sites.
    It was developed by Carie Peterson for her MS Plan B project.
    Presentation Slides(html),   Report(html)

  • Features:
    Input Configuration file provides external configuration (no need to recompile crawler for each new site);
    XPath syntax is used to navigate to desired nodes;
    Perl expressions provide filtering capabilities on result nodes;
    Only computer resources limit the number of retrievals that can be made


  • Installation:
    Download the tar file that contains the crawler code and tools from this site.
    Make sure that wwwlib and xml::xpath perl modules are installed and included in the perl environment. (check the include statements in the code for a complete list of the modules required.)
    Make sure the tinytreenode.pm module has been retrieved from the tar file and is in the directory with the crawler code.
    Download tidy.exe. (If you are using Unix, you will need to account for running tidy within the Unix environment.)


  • Further Work:
    1. Run crawler on sites other than VLDB and CIKM.
    2. Tie crawler into an XPath visualization tool so that users can test xpath expressions and see visual results (e.g. the nodes that are selected by the xpath expression are highlighted.)
    3. Develop a 'learning' component with the XPath visualizer that allows users to graphically choose the nodes they want (click and drag) and then generates the XPath expression that selects those nodes. Extend this functionality to treat this input as training data for a learner component.
    4. Add a 'workflow' component that allows the crawler to handle interactive sites (e.g. handles login pages, searches, etc.)
    5. Improve the efficiency of the crawler by processing all the steps in a transform (that use a particular input) at the same time. Currently, each step in a transform performs its own retrieval, regardless of whether than page was already retrieved by another step.
    6. Provide a user interface to the retrieved information, allowing users to launch into the pages from which it was taken.