originally: http://www.oclc.org/research/software/webutils/default.htm
The Webutils Open Source project offers perl utilities to support web harvesting and metadata extraction.
The Webutils code in the CVS repository is divided into modules for ease of retrieval.
The modules are listed below. The documentation is viewable. The Webutils code may be downloaded for use or evaluation, without using CVS.
WWW::Harvester | (v 1.15) | Documentation | Source |
This module provides an extensible mechanism for harvesting web pages, i.e, as a spider or robot. | |||
HTML::Normalizer | (v 1.04) | Documentation | Source |
This module extracts and normalizes the text of an HTML page. | |||
HTML::MetaExtor | (v 1.08) | Documentation | Source |
This module extracts metadata from the META elements of an HTML page. If supplied with a list of index terms, it will also report which terms are in the page. (Note: MetaExtor is dependent on Normalizer.) |