August 2006
Crawl-By-Example
Crawl-By-Example project is improving crawler ability to find useful and interesting pages, a plugin to the Heritrix crawler.
Ariel
a library that allows you to extract information from semi-structured documents (such as websites). Ariel will use a small number of labeled examples to generate and learn effective extraction rules.
RDig - Ferret based full text search for web sites
RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing.
February 2006
Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project
1
(4 marks)