Subsections


System components

All executables take a mandatory switch -jobname which is used to identify the particular crawl job you want and as well the job-specific configuration directory.

Briefly combineINIT is used to initialize SQL database and the job specific configuration directory. combineCtrl controls a Combine crawling job (start, stop, etc) as well as printing some statistics. combineExport export records in various XML formats and combineUtil provides various utility operations on the Combine database.

Detailed dependency information can be found in the 'Gory details' section.

combineINIT

Creates a Mysql database, database tables and initializes it. If the database exists it is dropped and recreated. A job specific configuration directory is created in /etc/combine/ and populated with a default configuration file.

If a topic definition filename is given, focused crawling using this topic defintion is enabled per default. Otherwise focused crawling is disabled, and Combine works as a general crawler.

combineCtrl

Implements various control functionality to administer a crawling job, like starting and stoping crawlers, injecting URLs into the crawl que, scheduling newly found links for crawling, controlling scheduling, etc.

This is the preferred way of controling a crawl job.

combineUtil

Does various statistics generation as well as performing sanity checks on the database.

combineExport

Export is done according to one of three profiles: alvis, dc, or combine. alvis and combine are very similar XML formats where combine is more compact with less redundancy and alvis contains some more information. dc is XML encoded Dublin Core data.

The alvis profile format is defined by the Alvis Enriched Document XML Schema.

For convinience a switch -xsltscript adds the possibility to filter the output using a XSLT script. The script is feed a record according to the combine profile and the result is exported.

Internal executables and Library modules

combine is the main crawling machine in the Combine system and combineRun starts, monitors and restarts combine crawling processes.

Library

Main, crawler specific, library components are collected in the Combine:: Perl namespace.

root 2006-11-29