All software and packages are available from a number of places
In addition to the distribution sites there is a public discussion list at SourceForge.
The system is distributed either as source or as a Debian package.
perl Makefile.PL make make test make install mkdir /etc/combine cp conf/* /etc/combine/ mkdir /var/run/combine
Test that it all works (run as root)
./doc/InstallationTest.pl
Furthermore the external Perl modules should be verified to work on the new platform.
Perl modules are most easily installed
using the Perl CPAN automated system
(perl -MCPAN -e shell).
Optionally these external programs will be used if they are installed on your system.
Download the latest distribution.
Install all software that Combine depends on (see above).
Unpack the archive with tar zxf
This will create a directory named combine-XX with
a number of subdirectories including bin, Combine, doc, and conf.
'bin' contains the executable programs.
'Combine' contains needed Perl modules. Should be copied to somewhere Perl will find them, typically /usr/share/perl5/Combine/.
'conf' contains the default configuration files. Combine looks for them in /etc/combine/ so they need to be copied there.
'doc' contains documentation.
The following command sequence will install Combine:
perl Makefile.PL make make test make install mkdir /etc/combine cp conf/* /etc/combine/ mkdir /var/run/combine
sudo combineINIT --jobname aatest --topic /etc/combine/Topic_carnivor.txt combine --jobname aatest --harvest http://combine.it.lth.se/CombineTests/InstallationTest.html combineExport --jobname aatest --profile dcand verify that the output, except for dates and order, looks like
<?xml version="1.0" encoding="UTF-8"?> <documentCollection version="1.1" xmlns:dc="http://purl.org/dc/elements/1.1/"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:format>text/html</dc:format> <dc:format>text/html; charset=iso-8859-1</dc:format> <dc:subject>Carnivorous plants</dc:subject> <dc:subject>Drosera</dc:subject> <dc:subject>Nepenthes</dc:subject> <dc:title transl="yes">Installation test for Combine</dc:title> <dc:description></dc:description> <dc:date>2006-05-19 9:57:03</dc:date> <dc:identifier>http://combine.it.lth.se/CombineTests/InstallationTest.html</dc:identifier> <dc:language>en</dc:language> </metadata>
Or run - as root - the script ./doc/InstallationTest.pl which essentially does the same thing.
--
jobname aatest
--
jobname aatest
--
jobname aatest --
harvesters 2
--
jobname aatest'
with various parameters.
--
jobname aatest
--
jobname aatest --
profile alvis
--
jobname aatest
Once a job is initialized it is controlled using combineCtrl. Crawled data is exported using combineExport.
--
jobname focustest
#use either URL or HOST: (obs ':') to match regular expressions to either the #full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow>to
#use either URL or HOST: (obs ':') to match regular expressions to either the #full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: www\.alvis\.info$ HOST: combine\.it\.lth\.se$ </allow>The escaping of '.' by writing '
\.
' is necessary since the patterns
actually are Perl regular expressions. Similarly the ending '$'
indicates that the host string should end here, so for example
a Web server on www.alvis.info.com (if such a one exists) will
not be crawled.
--
jobname focustest
--
jobname focustest
--
jobname focustest
--
jobname focustest --
profile alvis
--
jobname focustest
Create and maintain a topic specific crawled database for the topic 'Carnivorous plants'.
http://www.sarracenia.com/faq.html http://dmoz.org/Home/Gardening/Plants/Carnivorous_Plants/ http://www.omnisterra.com/bot/cp_home.cgi http://www.vcps.au.com/ http://www.murevarn.se/links.html
--
jobname cptest --
topic cpTopic.txt
This enables topic checking and focused crawl mode by setting
configuration variable doCheckRecord = 1 and copying a topic definition file (cpTopic.txt) to
/etc/combine/cptest/topicdefinition.txt.
--
jobname cptest < cpSeedURLs.txt
--
jobname cptest --
harvesters 3
--
jobname cptest --
profile alvis
--
jobname cptest