--jobname
which is used
to identify the particular crawl job you want as well as the
job-specific configuration directory.
Briefly combineINIT is used to initialize SQL database and the job specific configuration directory. combineCtrl controls a Combine crawling job (start, stop, etc.) as well as printing some statistics. combineExport exports records in various XML formats and combineUtil provides various utility operations on the Combine database.
Detailed dependency information can be found in the 'Gory details' section.
In appendix you'll find all the man-pages collected.
If a topic definition filename is given, focused crawling using this topic defintion is enabled per default. Otherwise focused crawling is disabled, and Combine works as a general crawler.
Implements various control functionality to administer a crawling job, like starting and stopping crawlers, injecting URLs into the crawl queue, scheduling newly found links for crawling, controlling scheduling, etc. This is the preferred way of controling a crawl job.
The alvis profile format is defined by the Alvis Enriched Document XML Schema.
For flexibility a switch --xsltscript
adds the possibility
to filter the output using a XSLT script. The script is fed
a record according to the combine profile and the result
is exported.
Switches --pipehost
and --pipeport
makes combineExport send it's output directly to an
Alvis
pipeline reader instead
of printing on stdout. This together with the switch --incremental
, which just exports changes since the last invocation,
provides an easy way of keeping an external system like Alvis or a Zebra
database updated.
Main, crawler-specific, library components are collected in the Combine:: Perl name-space.
root 2007-03-29