That something is horribly wrong with the character encoding of this page.
That something is wrong with character decoding of this page.
Put an
appropriate regular expression in the <allow> section of the configuration
file. Appropriate means
a Perl regular expression, which means that you have to escape special
characters. Try with
URL http:\/\/www\.foo\.com\/bar\/
Check that there are not 2 instances of the same simple configuration variable in the same configuration file. Unfortunately this will break configuration loading.
A match to any of the entries will make that URL allowable for crawling. You can use any mix of HOST: and URL entries
Presently the crawler only accepts HTTP, HTTPS, and FTP as protocols.
Yes it's one of the built-in limitations to keep the crawler beeing 'nice'.
It will only access a particular server once every 60 seconds by default.
You can change the default by adjusting the following configuration variables,
but please keep in mind that you increase the load on the server.
WaitIntervalSchedulerGetJcf=2
WaitIntervalHost = 5
Use the command:
combine --jobname XXX --harvesturl http://www.foo.com/bar.html
Initialize the database and load the seed pages. Turn of automatic recycling of links by setting the simple configuration variable 'AutoRecycleLinks' to 0.
Start crawling and stop when 'combineCtrl -jobname XXX howmany' equals 0.
Handle recycling manually using 'combineCtrl, with action 'recyclelinks'. (Give the command combineCtrl -jobname XXX recyclelinks')
Iterate to the depth of your liking.
You need to run combineINIT as root, due to file protection permissions.
They are stored in the SQL database <jobname> in the table log.
Std can handle Perl regular expressions in terms and does not take into account if the term is found in the beginning or end of the document. PosCheck can't handle Perl regular expressions but is faster, and takes word position and proximity into account.
For detailed descriptions see sections Algorithm 1 Algorithm 2.
40: sundew[^\s]*=CP.Drosera 40: tropical pitcher plant=CP.Nepenthes
It's part of the topic definition (term list) for the topic 'Carnivorous plants'. It's well described in the documentation, please see section 4.5.1. The strange characters are Perl regular expressions mostly used for truncation etc.
So for getting all pages about 'icecream' from 'www.yahoo.com' you have to:
100: icecream=YahooIce 100: ice cone=YahooIceand so on stored in a file called say TopicYahooIce.txt
#use either URL or HOST: (obs ':') to match regular expressions to either the #full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow>
to
#use either URL or HOST: (obs ':') to match regular expressions to either the #full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: www\.yahoo\.com$ </allow>
This is just a way of telling the crawler where to start.
The are three things that are problematic
If you take the source and look how the tests (make test) are made you might find a way to fix the first. Though this probably involves modifying the source - maybe only the Combine/Config.pm
The second is strictly not necessary and it will run even if /var/run /combine does not exist, although not
the command combineCtrl --jobname XXX kill
On the other hand the third is necessary and I can't think of a way around it except making a local installation of MySQL and use that.
| 5409 | HARVPARS 1_zltest | 2006-07-14 15:08:52 | M500; SD empty, sleep 20 second... |
This means that there are no URLs ready for crawling (SD empty). Also you can use combineCtrl to see current status of ready queue etc
| 7352 | HARVPARS 1_wctest | 2006-07-14 17:00:59 | M500; urlid=1; netlocid=1; http://www.shanghaidaily.com/
Crawler process 7352 got a URL (http://www.shanghaidaily.com/) to check (1_wctest is a just a name non significant) M500 is a sequence number for an individual crawler starting at 500 and when it reaches 0 this crawler process is killed and another is created. urlid and netlocid are internal identifiers used in the MySQL tables.
| 7352 | HARVPARS 1_wctest | 2006-07-14 17:01:10 | M500; RobotRules OK, OK
Crawler process have checked that this URL (identified earlier in the log by pid=7352 and M500) can be crawled according to the Robot Exclusion protocol.
| 7352 | HARVPARS 1_wctest | 2006-07-14 17:01:10 | M500; HTTP(200 = "OK") => OK
It has fetched the page (identified earlier in the log by pid=7352 and M500) OK
| 7352 | HARVPARS 1_wctest | 2006-07-14 17:01:10 | M500; Doing: text/html;200;0F061033DAF69587170F8E285E950120;Not used |
It is processing the page (in the format text/html) to see if it is of topical interest 0F061033DAF69587170F8E285E950120 is the MD5 checksum of the page
You have to get into the raw MySQL database and perform a query like
SELECT urls.urlstr FROM urls,recordurl,topic WHERE urls.urlid=recordurl.urlid AND recordurl.recordid=topic.recordid AND topic.notation='CP.Aldrovanda';
Table urls contain all URLs seen by the crawler. Table recordurl connect urlid to recordid. recordid is used in all tables with data from the crawled Web pages.
If you use multiple topics in your topic-definition (ie the string after '=') then all the relevant topic scores for this page is summed and given the topic notation 'ALL'.
Just disregard it if you only use one topic-class.
www.geocities.com/boulevard/newyork/
,
but not go outside the domain (i.e. going to www.yahoo.com
) but also not
going higher in position (i.e.
www.geocities.com/boulevard/atlanta/
).
Yes, change the <allow>-part of your configuration file combine.cfg to select what URLs should be allowed for crawling (by default everything is allowed). See also section 4.3.
So change
<allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow>to something like
<allow> #Allow crawl of URLs or hostnames that matches these regular expressions URL http:\/\/www\.geocities\.com\/boulevard\/newyork\/ </allow>
(the backslashes are needed since these patterns are in fact Perl regular expressions)
root 2007-09-27