Subsections
Configuration variables
AutoRecycleLinks
- Default value
- = 1
- Description:
- Enable(1)/disable(0) automatic recycling of new links
- Used by:
- SD_SQL.pm
baseConfigDir
- Default value
- = /etc/combine
- Description:
- Base directory for configuration files; initialized by Config.pm
- Used by:
- FromHTML.pm; combineExport
- Set by:
- Config.pm
classifyPlugIn
- Default value
- = Combine::Check_record
- Description:
- Which topic classification PlugIn module algorithm to use
Combine::Check_record and Combine::PosCheck_record included by default
see classifyPlugInTemplate.pm and documentation to write your own
- Used by:
- combine
configDir
- Default value
- = NoDefaultValue
- Description:
- Directory for job specific configuration files; taken from 'jobname'
- Used by:
- Check_record.pm; combineUtil; PosCheck_record.pm
- Set by:
- Config.pm
doAnalyse
- Default value
- = 1
- Description:
- Enable(1)/disable(0) analysis of genre, language
- Used by:
- combine
doCheckRecord
- Description:
- Enable(1)/disable(0) topic classification (focused crawling)
Generated by combineINIT based on -topic parameter
- Used by:
- combine
doOAI
- Default value
- = 1
- Description:
- Use(1)/do not use(0) OAI record status keeping in SQL database
- Used by:
- MySQLhdb.pm
extractLinksFromText
- Default value
- = 1
- Description:
- Extract(1)/do not extract(0) links from plain text
- Used by:
- combine
HarvesterMaxMissions
- Default value
- = 500
- Description:
- Number of pages to process before restarting the harvester
- Used by:
- combine
HarvestRetries
- Default value
- = 5
- Used by:
- combine
httpProxy
- Default value
- = NoDefaultValue
- Description:
- Use a proxy server if this is defined (default no proxy)
- Used by:
- UA.pm
LogHandle
- Used by:
- Check_record.pm; FromHTML.pm; PosCheck_record.pm
- Set by:
- combine
Loglev
- Description:
- Logging level (0 (least) - 10 (most))
- Used by:
- combine
maxUrlLength
- Default value
- = 250
- Description:
- Maximum length of a URL; longer will be silently discarded
- Used by:
- selurl.pm
MySQLdatabase
- Default value
- = NoDefaultValue
- Description:
- Identifies MySQL database name, user and host
- Used by:
- Config.pm
MySQLhandle
- Used by:
- combineUtil; LogSQL.pm; combine; RobotRules.pm; combineExport; SD_SQL.pm; combineRank; XWI2XML.pm; MySQLhdb.pm
- Set by:
- Config.pm
Operator-Email
- Default value
- = "YourEmailAdress@YourDomain"
- Description:
- Please change
- Used by:
- RobotRules.pm; UA.pm
Password
- Default value
- = "XxXxyYzZ"
- Description:
- Password not used yet. (Please change)
saveHTML
- Default value
- = 1
- Description:
- Store(1)/do not store(0) the raw HTML in the database
- Used by:
- MySQLhdb.pm
SdqRetries
- Default value
- = 5
SummaryLength
- Description:
- How long the summary should be. Use 0 to disable the summarization code
- Used by:
- FromHTML.pm
UAtimeout
- Default value
- = 30
- Description:
- Time in seconds to wait for a server to respond
- Used by:
- UA.pm
UserAgentFollowRedirects
- Description:
- User agent handles redirects (1) or treat redirects as new links (0)
- Used by:
- UA.pm
UserAgentGetIfModifiedSince
- Default value
- = 1
- Description:
- If we have seen this page before use Get-If-Modified (1) or not (0)
- Used by:
- UA.pm
useTidy
- Default value
- = 1
- Description:
- Use(1)/do not use(0) Tidy to clean the HTML before parsing it
- Used by:
- FromHTML.pm
WaitIntervalExpirationGuaranteed
- Default value
- = 315360000
- Used by:
- UA.pm
WaitIntervalHarvesterLockNotFound
- Default value
- = 2592000
- Used by:
- combine
WaitIntervalHarvesterLockNotModified
- Default value
- = 2592000
- Used by:
- combine
WaitIntervalHarvesterLockRobotRules
- Default value
- = 2592000
- Used by:
- combine
WaitIntervalHarvesterLockSuccess
- Default value
- = 1000000
- Description:
- Time in seconds after succesfull download before allowing a page to be downloaded again (around 11 days)
- Used by:
- combine
WaitIntervalHarvesterLockUnavailable
- Default value
- = 86400
- Used by:
- combine
WaitIntervalHost
- Default value
- = 60
- Description:
- Minimum time between accesses to the same host. Must be positive
- Used by:
- SD_SQL.pm
WaitIntervalRrdLockDefault
- Default value
- = 86400
- Used by:
- RobotRules.pm
WaitIntervalRrdLockNotFound
- Default value
- = 345600
- Used by:
- RobotRules.pm
WaitIntervalRrdLockSuccess
- Default value
- = 345600
- Used by:
- RobotRules.pm
WaitIntervalSchedulerGetJcf
- Default value
- = 20
- Description:
- Time in seconds to wait before making a new reschedule if a reschedule results in an empty ready que
- Used by:
- combine
ZebraHost
- Default value
- = NoDefaultValue
- Description:
- Direct connection to Zebra indexing - for SearchEngine-in-a-box (default no connection)
- Used by:
- MySQLhdb.pm
allow
- Description:
- use either URL or HOST: (obs ':') to match regular expressions to
either the full URL or the HOST part of a URL.
Allow crawl of URLs or hostnames that matches these regular expressions
- Used by:
- selurl.pm
binext
- Description:
- Extensions of binary files
- Used by:
- UA.pm
converters
- Description:
- Configure which converters can be used to produce a XWI object
Format:
1 line per entry
each entry consists of 3 ';' separated fields
Entries are processed in order and the first match is executed
external converters have to be found via PATH and executable to be considered a match
the external converter command should take a filename as parameter and convert that file
the result should be comming on STDOUT
mime-type ; External converter command ; Internal converter
- Used by:
- UA.pm; combine
exclude
- Description:
- Exclude URLs or hostnames that matches these regular expressions
default: CGI and maps
default: binary files
default: Unparsable documents
default: images
default: other binary formats
more excludes in the file config_exclude (automatically updated by other programs)
- Used by:
- selurl.pm
serveralias
- Description:
- List of servernames that are aliases are in the file ./config_serveralias
(automatically updated by other programs)
use one server per line
example
www.100topwetland.com www.100wetland.com
means that www.100wetland.com is replaced by www.100topwetland.com during URL normalization
sessionids
- Description:
- patterns to recognize and remove sessionids in URLs
url
- Description:
- url is just a conatiner for all URL related configuration patterns
- Used by:
- Config.pm; selurl.pm
root
2007-09-27