Subsections
Configuration
All configuration files are stored in the /etc/combine/
directory tree. All configuration variables have reasonable defaults (section 9).
- job_default.cfg
- is job specific defaults. It is copied to a
subdirectory named after the job by combineINIT.
- SQLstruct.sql
- Structure of the internal SQL database used both for administration
and to hold data records. Details.
- Topic_*
- contains various contributed topic definitions.
- Global configuration files
- These files are used for global
parameters for all crawler jobs.
- default.cfg
- is the global defaults. It is loaded first.
Consult 'Configuration Variables'
and 'Default configuration files' for details.
Values can be overridden from
the job-specific configuration file combine.cfg.
- tidy.cfg
- configuration for Tidy cleaning of HTML code
- Files in job specific sub-directories
- The program combineINIT creates
a job specific subdirectory in /etc/combine and populates it with some files including combine.cfg
initialized with a copy of job_default.cfg.
The job-name have to be given to all programs
when started using the
--jobname
switch.
- combine.cfg
- the job specific configuration. It is loaded secondly
and overrides the global defaults. Consult section 'Configuration Variables'
and 'Default configuration files' for details.
- topicdefinition.txt
- contains the topic definition for
focused crawl if the
--topic
switch is given to combineINIT.
The format of this file is described in 'Topic definition'.
- stopwords.txt
- a file with words to be excluded from the automatic topic
classification processing. One word per line. Can be empty but must be present.
- config_exclude
- contains more exclude patterns.
Optional, automatically included by combine.cfg. Updated by combineUtil.
- config_serveralias
- contains patterns for resolving Web server aliases.
Optional, automatically included by combine.cfg. Updated by combineUtil.
- sitesOK.txt
- optionally used by the
built in automated classification algorithms to bypass
the topic filter for certain sites.
Configuration files use a simple format consisting of either name/value pairs
or complex variables in sections. Name/value pairs are encoded as single lines
formated like 'name = value'. Complex variables are encoded as multiple
lines in named sections delimited as in XML, using '<name> ... </name>'.
Sections may be nested for related configuration variables.
Empty lines and lines starting with '#' (comments) are ignored.
The most important configuration variables are the complex variables
<url><allow> (allows certain URLs to be harvested) and <url><exclude> (excludes certain URLs from harvesting) which are used
to limit your crawl to just a section of the Web, based on the URL.
Loading of URLs to be crawled into the system checks each URL first
against the Perl regular expressions of <url><allow> and if it
matches goes on to match it against <url><exclude> where it's
discarded if it matches, otherwise it's scheduled for crawling.
(See 'URL filtering').
You should always change the value of the variable Operator-Email
in the file
/etc/combine/aatest/combine.cfg and set it to something
reasonable. It is used by Combine to identify you to the crawled Web-servers.
Further details are found in
'Configuration variables'.
root
2006-11-29