Subsections
Open source distribution, installation
The focused crawler has been restructured and packaged as a Debian
package in order to ease distribution and installation. The package
contains dependency information to make sure that all software that is
needed to run the crawler is installed at the same time. In connection
with this we have also packaged a number of necessary Perl-modules as
Debian packages.
All software and packages are available from a number of places:
In addition to the distribution sites there is a public
discussion list at SourceForge.
This distribution is developed and tested on Linux systems.
It is implemented entirely in Perl and uses the MySQL
database system, both of which are supported on many other
operating systems. Porting to other UNIX dialects should be easy.
The system is distributed either as source or as a Debian package.
Unless you are on a system supporting Debian packages (in which case look at Automated installation), you should
download and unpack the source.
The following command sequence will then install Combine:
CREATE TABLE log (
pid int(11) NOT NULL default '0',
id varchar(50) default NULL,
date timestamp NOT NULL,
message varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
Test that it all works (run as root)
./doc/InstallationTest.pl
In order to port the system to another platform, you
have to verify the availability, for this platform, of the two main systems:
If they are supported you stand a good chance to port the system.
Furthermore,
the external Perl modules should be verified to work
on the new platform.
Perl modules are most easily installed
using the Perl CPAN automated system
(perl -MCPAN -e shell).
Optionally the following external programs will be used if they are
installed on your system:
- antiword (parsing MSWord files)
- detex (parsing TeX files)
- pdftohtml (parsing PDF files)
- pstotext (parsing PS and PDF files, needs ghostview)
- xlhtml (parsing MSExcel files)
- ppthtml (parsing MSPowerPoint files)
- unrtf (parsing RTF files)
- tth (parsing TeX files)
- untex (parsing TeX files)
Automated Debian/Ubuntu installation
- Add the following lines to your /etc/apt/sources.list:
deb http://combine.it.lth.se/ debian/
- Give the commands:
apt-get update
apt-get install combine
This also installs all dependencies such as MySQL and a lot of necessary
Perl modules.
Download the latest distribution.
Install all software that Combine depends on (see above).
Unpack the archive with tar zxf
This will create a directory named combine-XX with
a number of subdirectories including bin, Combine, doc, and conf.
'bin' contains the executable programs.
'Combine' contains needed Perl modules. They should be copied to
where Perl will find them, typically /usr/share/perl5/Combine/.
'conf' contains the default configuration files. Combine looks for them
in /etc/combine/ so they need to be copied there.
'doc' contains documentation.
The following command sequence will install Combine:
CREATE TABLE que (
netlocid int(11) NOT NULL default '0',
urlid int(11) NOT NULL default '0',
queid int(11) NOT NULL auto_increment,
PRIMARY KEY (queid)
) ENGINE=MEMORY DEFAULT CHARACTER SET=utf8;
A simple way to test your newly installed Combine system is
to crawl just one Web-page and export it as an XML-document. This will
exercise much of the code and guarantee that basic focused crawling will work.
- Initialize a crawl-job named aatest. This will create and populate
the job-specific configuration directory and create the MySQL database
that will hold the records:
CREATE TABLE robotrules (
netlocid int(11) NOT NULL default '0',
expire int(11) NOT NULL default '0',
rule varchar(255) default '',
KEY netlocid (netlocid)
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE oai (
recordid int(11) NOT NULL default '0',
md5 char(32),
date timestamp,
status enum('created', 'updated', 'deleted'),
PRIMARY KEY (md5),
KEY date (date)
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
- Export a structured Dublin Core record by:
CREATE TABLE exports (
host varchar(30),
port int,
last timestamp DEFAULT '1999-12-31'
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
- and verify that the output, except for dates and order, looks like:
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,CREATE TEMPORARY TABLES,
ALTER,LOCK TABLES ON $database.* TO $dbuser;
Or run - as root - the script
./doc/InstallationTest.pl
which essentially does the same thing.
Getting started
A simple example work-flow for a trivial crawl job name 'aatest' might look like:
- Initialize database and configuration (needs root privileges)
sudo combineINIT
jobname aatest
- Load some seed URLs like (you can repeat this command with different URLs as many times as you wish)
echo 'http://combine.it.lth.se/' | combineCtrl load
jobname aatest
- Start 2 harvesting processes
combineCtrl start
jobname aatest
harvesters 2
- Let it run for some time. Status and progress can be checked using
the program 'combineCtrl
jobname aatest'
with various parameters.
- When satisfied kill the crawlers
combineCtrl kill
jobname aatest
- Export data records in the ALVIS XML format
combineExport
jobname aatest
profile alvis
- If you want to schedule a recheck for all the crawled pages stored in the database do
combineCtrl reharvest
jobname aatest
- Go back to 3 for continuous operation.
Once a job is initialized it is controlled using
combineCtrl. Crawled data is exported using combineExport.
The latest, updated, detailed documentation is always available
online.
Use the same procedure as in section 2.2. This way of
crawling is not recommended for the Combine system since it will
generate really huge databases without any focus.
Focused crawling - domain restrictions
Create a focused database with all pages from a Web-site. In this
use scenario we will crawl the Combine site and the ALVIS site.
The database is to be continuously updated, i.e. all pages have to be
regularly tested for changes, deleted pages should be removed from
the database, and newly created pages added.
- Initialize database and configuration
sudo combineINIT
jobname focustest
- Edit the configuration to provide the desired focus
Change the <allow> part in /etc/combine/focustest/combine.cfg from
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,CREATE TEMPORARY TABLES,
ALTER,LOCK TABLES ON $database.* TO $dbuser\@localhost;
to
%scores are propagated to the most specific
The escaping of '.' by writing '
' is necessary since the patterns
actually are Perl regular expressions. Similarly the ending '$'
indicates that the host string should end here, so for example
a Web server on www.alvis.info.com (if such exists) will
not be crawled.
- Load seed URLs
echo 'http://combine.it.lth.se/' | combineCtrl load
jobname focustest
echo 'http://www.alvis.info/' | combineCtrl load
jobname focustest
- Start 1 harvesting process
combineCtrl start
jobname focustest
- Daily export all data records in the ALVIS XML format
combineExport
jobname focustest
profile alvis
and schedule all pages for re-harvesting
combineCtrl reharvest
jobname focustest
Focused crawling - topic specific
Create and maintain a topic specific crawled database for the topic 'Carnivorous plants'.
- Create a topic definition (see section 4.5.1) in a local file named cpTopic.txt. (Can be done by copying /etc/combine/Topic_carnivor.txt since it happens to be just that.)
- Create a file named cpSeedURLs.txt with seed URLs for this
topic, containing the URLs:
%classifications suggested by the topic filter. This is achieved by assigning, for each
- Initialization
sudo combineINIT
jobname cptest
topic cpTopic.txt
This enables topic checking and focused crawl mode by setting
configuration variable doCheckRecord = 1 and copying a topic definition file (cpTopic.txt) to
/etc/combine/cptest/topicdefinition.txt.
- Load seed URLs
combineCtrl load
jobname cptest < cpSeedURLs.txt
- Start 3 harvesting process
combineCtrl start
jobname cptest
harvesters 3
- Regularly export all data records in the ALVIS XML format
combineExport
jobname cptest
profile alvis
Running this crawler for an extended period will result in more than
200 000 records.
Use the same procedure as in section Focused crawling - topic specific
except for the last point. Exporting should be done incrementally into an Alvis
pipeline (in this example listening at port 3333 on the machine nlp.alvis.info):
This scenario requires the crawler to:
- crawl an entire target site
- crawl all the outlinks from the site
- crawl no other site or URL apart from
external URLs mentioned on the one target site
I.e. all of
,
plus any other URL that is linked to from a page in
.
- Configure Combine to crawl this one site only.
Change the <allow> part in
/etc/combine/XXX/combine.cfg to
%leaf node in the list, the sum of it's score and all scores for all
- Crawl until you have the entire site (if it's a big site you might want to do the changes
suggested in FAQ no ).
- Stop crawling.
- Change configuration <allow> back to allow crawling
of any domain (which is the default).
%suggested classifications above in the classification tree, to that
- Schedule all links in the database for crawling, something like (change XXX to your jobname)
- Change configuration to disable automatic recycling of links:
and maybe (depending or your other requirements) change:
- Start crawling and run until no more in queue.
root
2008-11-13