Subsections


Configuration variables

Name/value configuration variables


AutoRecycleLinks

Default value
= 1
Description:
Enable(1)/disable(0) automatic recycling of new links
Used by:
SD_SQL.pm


baseConfigDir

Default value
= /etc/combine
Description:
Base directory for configuration files; initialized by Config.pm
Used by:
FromHTML.pm; combineExport
Set by:
Config.pm


classifyPlugIn

Default value
= Combine::Check_record
Description:
Which topic classification PlugIn module algorithm to use
Combine::Check_record and Combine::PosCheck_record included by default
see classifyPlugInTemplate.pm and documentation to write your own
Used by:
combine


configDir

Default value
= NoDefaultValue
Description:
Directory for job specific configuration files; taken from 'jobname'
Used by:
Check_record.pm; combineUtil; PosCheck_record.pm
Set by:
Config.pm


doAnalyse

Default value
= 1
Description:
Enable(1)/disable(0) analysis of genre, language
Used by:
combine


doCheckRecord

Description:
Enable(1)/disable(0) topic classification (focused crawling)
Generated by combineINIT based on -topic parameter
Used by:
combine


doOAI

Default value
= 1
Description:
Use(1)/do not use(0) OAI record status keeping in SQL database
Used by:
MySQLhdb.pm


extractLinksFromText

Default value
= 1
Description:
Extract(1)/do not extract(0) links from plain text
Used by:
combine


HarvesterMaxMissions

Default value
= 500
Description:
Number of pages to process before restarting the harvester
Used by:
combine


HarvestRetries

Default value
= 5
Used by:
combine


httpProxy

Default value
= NoDefaultValue
Description:
Use a proxy server if this is defined (default no proxy)
Used by:
UA.pm


LogHandle

Used by:
Check_record.pm; FromHTML.pm; PosCheck_record.pm
Set by:
combine


Loglev

Description:
Logging level (0 (least) - 10 (most))
Used by:
combine


maxUrlLength

Default value
= 250
Description:
Maximum length of a URL; longer will be silently discarded
Used by:
selurl.pm


MySQLdatabase

Default value
= NoDefaultValue
Description:
Identifies MySQL database name, user and host
Used by:
Config.pm


MySQLhandle

Used by:
combineUtil; LogSQL.pm; combine; RobotRules.pm; combineExport; SD_SQL.pm; combineRank; XWI2XML.pm; MySQLhdb.pm
Set by:
Config.pm


Operator-Email

Default value
= "YourEmailAdress@YourDomain"
Description:
Please change
Used by:
RobotRules.pm; UA.pm


Password

Default value
= "XxXxyYzZ"
Description:
Password not used yet. (Please change)


saveHTML

Default value
= 1
Description:
Store(1)/do not store(0) the raw HTML in the database
Used by:
MySQLhdb.pm


SdqRetries

Default value
= 5


SummaryLength

Description:
How long the summary should be. Use 0 to disable the summarization code
Used by:
FromHTML.pm


UAtimeout

Default value
= 30
Description:
Time in seconds to wait for a server to respond
Used by:
UA.pm


UserAgentFollowRedirects

Description:
User agent handles redirects (1) or treat redirects as new links (0)
Used by:
UA.pm


UserAgentGetIfModifiedSince

Default value
= 1
Description:
If we have seen this page before use Get-If-Modified (1) or not (0)
Used by:
UA.pm


useTidy

Default value
= 1
Description:
Use(1)/do not use(0) Tidy to clean the HTML before parsing it
Used by:
FromHTML.pm


WaitIntervalExpirationGuaranteed

Default value
= 315360000
Used by:
UA.pm


WaitIntervalHarvesterLockNotFound

Default value
= 2592000
Used by:
combine


WaitIntervalHarvesterLockNotModified

Default value
= 2592000
Used by:
combine


WaitIntervalHarvesterLockRobotRules

Default value
= 2592000
Used by:
combine


WaitIntervalHarvesterLockSuccess

Default value
= 1000000
Description:
Time in seconds after succesfull download before allowing a page to be downloaded again (around 11 days)
Used by:
combine


WaitIntervalHarvesterLockUnavailable

Default value
= 86400
Used by:
combine


WaitIntervalHost

Default value
= 60
Description:
Minimum time between accesses to the same host. Must be positive
Used by:
SD_SQL.pm


WaitIntervalRrdLockDefault

Default value
= 86400
Used by:
RobotRules.pm


WaitIntervalRrdLockNotFound

Default value
= 345600
Used by:
RobotRules.pm


WaitIntervalRrdLockSuccess

Default value
= 345600
Used by:
RobotRules.pm


WaitIntervalSchedulerGetJcf

Default value
= 20
Description:
Time in seconds to wait before making a new reschedule if a reschedule results in an empty ready que
Used by:
combine


ZebraHost

Default value
= NoDefaultValue
Description:
Direct connection to Zebra indexing - for SearchEngine-in-a-box (default no connection)
Used by:
MySQLhdb.pm

Complex configuration variables


allow

Description:
use either URL or HOST: (obs ':') to match regular expressions to
either the full URL or the HOST part of a URL.
Allow crawl of URLs or hostnames that matches these regular expressions
Used by:
selurl.pm


binext

Description:
Extensions of binary files
Used by:
UA.pm


converters

Description:
Configure which converters can be used to produce a XWI object
Format:
1 line per entry
each entry consists of 3 ';' separated fields
Entries are processed in order and the first match is executed
external converters have to be found via PATH and executable to be considered a match
the external converter command should take a filename as parameter and convert that file
the result should be comming on STDOUT
mime-type ; External converter command ; Internal converter
Used by:
UA.pm; combine


exclude

Description:
Exclude URLs or hostnames that matches these regular expressions
default: CGI and maps
default: binary files
default: Unparsable documents
default: images
default: other binary formats
more excludes in the file config_exclude (automatically updated by other programs)
Used by:
selurl.pm


serveralias

Description:
List of servernames that are aliases are in the file ./config_serveralias
(automatically updated by other programs)
use one server per line
example
www.100topwetland.com www.100wetland.com
means that www.100wetland.com is replaced by www.100topwetland.com during URL normalization


sessionids

Description:
patterns to recognize and remove sessionids in URLs


url

Description:
url is just a conatiner for all URL related configuration patterns
Used by:
Config.pm; selurl.pm

root 2007-03-29