Configuration files use a simple format consisting of either name/value pairs or complex variables in sections. Name/value pairs are encoded as single lines formated like ’name = value’. Complex variables are encoded as multiple lines in named sections delimited as in XML, using ’<name> ... </name>’. Sections may be nested for related configuration variables. Empty lines and lines starting with ’#’ (comments) are ignored.
The most important configuration variables are the complex variables <url><allow> (allows certain URLs to be harvested) and <url><exclude> (excludes certain URLs from harvesting) which are used to limit your crawl to just a section of the WWW, based on the URL. Loading URLs to be crawled into the system checks each URL first against the Perl regular expressions of <url><allow> and if it matches goes on to match it against <url><exclude> where it’s discarded if it matches, otherwise it’s scheduled for crawling. (See section 4.3 ’URL filtering’).
All configuration files are stored in the /etc/combine/ directory tree. All configuration variables have reasonable defaults (section 9).
The values in
Files used for global parameters for all crawler jobs.
The program combineINIT creates a job specific sub-directory in /etc/combine and populates it with some files including combine.cfg initialized with a copy of job_default.cfg. You should always change the value of the variable Operator-Email in this file and set it to something reasonable. It is used by Combine to identify you to the crawled Web-servers.
The job-name have to be given to all programs when started using the --jobname switch.
Further details are found in section 9 ’Configuration variables’ which lists all variables and their default values.