Performance evaluation of the automated subject classification component is treated in section 5. Performance in terms of number of URLs treated per minute is of course highly dependent on a number of circumstances like network load, capacity of the machine, the selection of URLs to crawl, configuration details, number of crawlers used, etc. In general, within rather wide limits, you could expect the Combine system to handle up to 200 URLs per minute. Handle here means everything from scheduling of URLs, fetching of pages over the network, parsing the page, automated subject classification, recycling of new links, to storing the structured record in a relational database. This holds for small simple crawls starting from scratch to large complicated topic specific crawls with millions of records.
The prime way of increasing performance is to use more than one
crawler for a job. This is handled by the -harvesters switch
used together with the combineCtrl start command (for example
combineCtrl -jobname MyCrawl -harvesters 5 start
will start 5 crawlers working together on the job 'MyCrawl'. The
effect of using more than one crawler on crawling speed is illustrated
in figure 4 below.
Configuration also have an effect on performance. In figure 5 performance improvements based on configuration changes are shown. The choice of algorithm for automated classification turns out to have biggest influence on performance, where algorithm 2 (classifyPlugIn = Combine::PosCheck_record - Pos in figure 5) is much faster than algorithm 1 (classifyPlugIn = Combine::Check_record - Std in figure 5). Tweaking of other configuration variables also have an effect on performance but to a lesser degree. Tweaking consisted of not using Tidy to clean HTML (useTidy = 0) and not storing the original page in the database (saveHTML = 0).
root 2006-11-29