Subsections


APPENDIX


Simple installation test

The following simple script is available in the doc/InstallationTest.pl file. It must be run as 'root' and tests that basic functions of the Combine installation works.

Basicly it creates and initializes a new jobname, crawls one specific test page and exports it as XML. This XML is then compared to a correct XML-record for that page.

InstallationTest.pl

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}


Example topic filter plug in

This example gives more details on how to write a topic filter Plug-In.

classifyPlugInTemplate.pm

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}

Default configuration files

Global

$weight[\mbox{term}_{i}]$

Job specific

$weight[\mbox{location}_{j}]$


SQL database

Create database




Creating MySQL tables



























Data tables

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}

\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}

$weight[\mbox{term}_{i}]$

$weight[\mbox{location}_{j}]$

$hits[\mbox{location}_{j}][\mbox{term}_{i}]$

$\mbox{term}_{i}$

$\mbox{location}_{j}$

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}

Administrative tables

\begin{displaymath}
\sum_{\mbox{all terms}} \left( \sum_{\mbox{all matches}}
\frac{weight[\mbox{term}_{i}]}{\log(k * position[\mbox{term}_{i}][\mbox{match}_{j}]) * proximity[\mbox{term}_{i}][\mbox{match}_{j}]} \right)
\end{displaymath}

$weight[\mbox{term}_{i}]$


$position[\mbox{term}_{i}][\mbox{match}_{j}]$

$\mbox{match}_{j}$

$\mbox{term}_{i}$

$proximity[\mbox{term}_{i}][\mbox{match}_{j}]$

$\log(distance\_between\_components)$

CREATE TABLE recordurl (
  recordid int(11) NOT NULL auto_increment,
  urlid int(11) NOT NULL default '0',
  lastchecked timestamp NOT NULL,
  md5 char(32),
  fingerprint char(50),
  KEY md5 (md5),
  KEY fingerprint (fingerprint),
  PRIMARY KEY (urlid),
  KEY recordid (recordid)
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;

CREATE TABLE admin (
  status enum('closed','open','paused','stopped') default NULL,
  schedulealgorithm enum('default','bigdefault','advanced') default 'default',
  queid int(11) NOT NULL default '0'
) ENGINE=MEMORY DEFAULT CHARACTER SET=utf8;



CREATE TABLE log (
  pid int(11) NOT NULL default '0',
  id varchar(50) default NULL,
  date timestamp NOT NULL,
  message varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;





Create user dbuser with required priviligies




Manual pages

combineExport


NAME

combineExport - export records in XML from Combine database


SYNOPSIS

combineExport -jobname $<$name$>$ [-profile alvis$\vert$dc$\vert$combine -charset utf8$\vert$isolatin -number $<$n$>$ -recordid $<$n$>$ -md5 $<$MD5$>$ -pipehost $<$server$>$ -pipeport $<$n$>$ -incremental ]


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)

-profile

Three profiles: alvis, dc, and combine . alvis and combine are similar XML formats.

'alvis' profile format is defined by the Alvis enriched document format DTD. It uses charset UTF-8 per default.

'combine' is more compact with less redundancy.

'dc' is XML encoded Dublin Core data.

-charset

Selects a specific characterset from UTF-8, iso-latin-1 Overrides -profile settings.

-collapseinlinks

Skip inlinks with duplicate anchor-texts (ie just one inlink per unique anchor-text).

-nooutlinks

Do not include any outlinks in the exported records.

-ZebraIndex

ZebraIndex sends XML records directly to the Zebra server defined in Combine configuration variable 'ZebraHost'. It uses the default Zebra configuration: profile=combine, nooutlinks, collapseinlinks and is compatible with the direct Zebra indexing done during harvesting when 'ZebraHost' is defined in the Combine configuration. Requires that the Zebra server is running.

-xsltscript

Generates records in Combine native format and converts them using this XSLT script before output. See example scripts in /etc/combine/*.xsl

-number

the max number of records to be exported

-recordid

Export just the one record with this recordid

-md5

Export just the one record with this MD5 checksum

-pipehost, -pipeport

Specifies the server-name and port to connect to and export data using the Alvis Pipeline. Exports incrementally, ie all changes since last call to combineExport with the same pipehost and pipeport.

-incremental

Exports incrementally, ie all changes since last call to combineExport using -incremental


DESCRIPTION


EXAMPLES

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$
$weight[\mbox{location}_{j}]$


SEE ALSO

Combine configuration documentation in /usr/share/doc/combine/.

Alvis XML schema (-profile alvis) at http://project.alvis.info/alvis_docs/enriched-document.xsd


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$<$


COPYRIGHT AND LICENSE

Copyright (C) 2005 - 2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

$hits[\mbox{location}_{j}][\mbox{term}_{i}]$


combineCtrl


NAME

combineCtrl - controls a Combine crawling job


SYNOPSIS

combineCtrl $<$action$>$ -jobname $<$name$>$

where action can be one of start, kill, load, recyclelinks, reharvest, stat, howmany, records, hosts, initMemoryTables, open, stop, pause, continue


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)


Actions starting/killing crawlers
start

takes an optional switch -harvesters n where n is the number of crawler processes to start

kill

kills all active crawlers (and their associated combineRun monitors) for jobname


Actions loading or recycling URLs for crawling
load

Read a list of URLs from STDIN (one per line) and schedules them for crawling

recyclelinks

Schedule all newly found (since last invocation of recyclelinks) links in crawled pages for crawling

reharvest

Schedules all pages in the database for crawling again (in order to check if they have changed)


Actions for controlling scheduling of URLs
open

opens database for URL scheduling (maybe after a stop)

stop

stops URL scheduling

pause

pauses URL scheduling

continue

continues URL scheduling after a pause


Misc actions
stat

prints out rudimentary status of the ready queue (ie eligible now) of URLs to be crawled

howmany

prints out rudimentary status of all URLs to be crawled

records

prints out the number of ercords in the SQL database

hosts

prints out rudimentary status of all hosts that have URLs to be crawled

initMemoryTables

initializes the administrative MySQL tables that are kept in memory


DESCRIPTION

Implements various control functionality to administer a crawling job, like starting and stoping crawlers, injecting URLs into the crawl queue, scheduling newly found links for crawling, controlling scheduling, etc.

This is the preferred way of controling a crawl job.


EXAMPLES

echo 'http://www.yourdomain.com/' $\vert$ combineCtrl load -jobname aatest

Seed the crawling job aatest with a URL

combineCtrl start -jobname aatest -harvesters 3

Start 3 crawling processes for job aatest

combineCtrl recyclelinks -jobname aatest

Schedule all new links crawling

combineCtrl stat -jobname aatest

See how many URLs that are eligible for crawling right now.


SEE ALSO

combine

Combine configuration documentation in /usr/share/doc/combine/.


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


combineRun


NAME

combineRun - starts, monitors and restarts a combine harvesting process


SYNOPSIS

combineRun $>$pidfile$<$ $>$combine command to run$\backslash$


DESCRIPTION

Starts a program and monitors it in order to make sure there is alsways a copy running. If the program dies it will be restarted with the same parameters. Used by combineCtrl when starting combine crawling.


SEE ALSO

combineCtrl


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


combineReClassify


NAME

combineReClassify - main program that reanalyse records in a combine database

Algorithm: select relevant records based on cls parameter for each record get record from database delete analyse infor from the record analyse the record if still_relevant save in database


combineSVM


NAME

combineSVM - generate a SVM model from good and bad examples


SYNOPSIS

combineSVM -jobname $>$name$>$ [-good $<$good-file$>$] [-bad $<$bad-file$>$] [-train $<$model-file$>$] [-help]


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)

good is the name of a file with good URLs, one per line. Default 'goodURL.txt'

bad is the name of a file with bad URLs, one per line. Default 'badURL.txt'

train is the name of the file where the trained SVM model will be stored. Default 'SVMmodel.txt'


DESCRIPTION

Takes two files, one with positive examples (good) and one with negative examples (bad) and trains a SVM classifier using these. The resulting model is stored in the file $<$train$>$.

The example files should contain one URL per line and nothing else.


SEE ALSO

combine

Combine configuration documentation in /usr/share/doc/combine/.


AUTHOR

Ignacio Garcia Dorado Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2008 Ignacio Garcia Dorado, Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


combineRank


NAME

combineRank - calculates various Ranks for a Combine crawled database


SYNOPSIS

combineRank $<$action$>$ -jobname $<$name$>$ -verbose

where action can be one of PageRank, PageRankBL, NetLocRank, and exportLinkGraph. Results on STDOUT.


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)

verbose enables printing of ranks to STDOUT as SQL INSERT statements


Actions calculating variants of PageRank
PageRank

calculate standard PageRank

PageRankBL

calculate PageRanks with backlinks added for each link

NetLocRank

calculate SiteRank for each site and a local DocRank for documents within each site. Global ranks are then calulated as SiteRank * DocRank


Actions exporting link data
exportLinkGraph

export linkgraph from Combine database


DESCRIPTION

Implements calculation of different variants of PageRank.

Results are written to STDOUT and can be huge for large databases.

Linkgraph is exported in ASCII as a sparse matrix, one row per line. First integer is the ID (urlid) of a page with links. The rest of integers on the line are IDs for pages linked to. Ie 121 5624 23416 51423 267178 means that page 121 links to pages 5624 23416 51423 267178


EXAMPLES

combineRank -jobname aatest -verbose PageRankBL

calculate PageRank with backlinks, result on STDOUT

combineRank -jobname aatest -verbose exportLinkGraph

export the linkgraph to STDOUT


SEE ALSO

combine

Combine configuration documentation in /usr/share/doc/combine/.


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


combineUtil


NAME

combineUtil - various operations on the Combine database


SYNOPSIS

combineUtil $<$action$>$ -jobname $<$name$>$

where action can be one of stats, termstat, classtat, sanity, all, serveralias, resetOAI, restoreSanity, deleteNetLoc, deletePath, deleteMD5, deleteRecordid, addAlias


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)


Actions listing statistics
stats

Global statistics about the database

termstat

generates statistics about the terms from topic ontology matched in documents (can be long output)

classtat

generates statistics about the topic classes assigned to documents


Actions for sanity controlls
sanity

Performs various sanity checks on the database

restoreSanity

Deletes records which sanity checks finds insane

resetOAI

Removes all history (ie 'deleted' records) from the OAI table. This is done by removing the OAI table and recreating it from the existing database.


Action all

Does the actions: stats, sanity, classtat, termstat


Actions for deleting records
deleteNetLoc

Deletes all records matching the ','-separated list of server net-locations (server-names optionally with port) in the switch -netlocstr. Net-locations can include SQL wild cards ('%').

deletePath

Deletes all records matching the ','-separated list of URl paths (excluding net-locations) in the switch -pathsubstr. Paths can include SQL wild cards ('%').

deleteMD5

Delete the record which has the MD5 in switch -md5

deleteRecordid

Delete the record which has the recordid in switch -recordid


Actions for handling server aliases
serverAlias

Detect server aliases in the current database and do a 'addAlias' on each detected alias.

addAlias

Manually add a serveralias to the system. Requires switches -aliases and -preferred


DESCRIPTION

Does various statistics generation as well as performing sanity checks on the database


EXAMPLES

combineUtil termstat -jobname aatest

Generate matched term statistics


SEE ALSO

combine

Combine configuration documentation in /usr/share/doc/combine/.


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


combine


NAME

Combine - Focused Web crawler framework


SYNOPSIS

combine -jobname $<$name$>$ -logname $<$id$>$


OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)

logname is used as identifier in the log (in MySQL table log)


DESCRIPTION

Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.

A simple workflow for a trivial crawl job might look like:

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$

For more complex jobs you have to edit the job configuration file.


SEE ALSO

combineINIT, combineCtrl

Combine configuration documentation in /usr/share/doc/combine/.


AUTHOR

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::PosMatcher


NAME

PosMatcher


DESCRIPTION

This a module in the DESIRE automatic classification system. Copyright 1999.

Exported routines: 1. Fetching text: These routines all extract texts from a document (either a Combine record, a Combine XWI datastructure or a WWW-page identified by a URL. They all return: $meta, $head, $text, $url, $title, $size $meta: Metadata from document $head: Important text from document $text: Plain text from document $url: URL of the document $title: HTML title of the document $size: The size of the document

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$

2. Term matcher accepts a text as a (reference) parameter, matches each term in Term against text Matches are recorded in an associative array with class as key and summed weight as value. Match parameters: $text, $termlist $text: text to match against the termlist $termlist: object pointer to a LoadTermList object with a termlist loaded output: %score: an associative array with classifications as keys and scores as values

3. Heuristics: sum scores down the classification tree to the leafs cleanEiTree parameters: %res - an associative array from Match output: %res - same array


AUTHOR

Anders Ardö, $>$anders.ardo@it.lth.se$<$


COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::selurl


NAME

selurl - Normalise and validate URIs for harvesting


INTRODUCTION

Selurl selects and normalises URIs on basis of both general practice (hostname lowercasing, portnumber substsitution etc.) and Combine-specific handling (aplpying config_allow, config_exclude, config_serveralias and other relevant config settings).

The Config settings catered for currently are:

maxUrlLength - the maximum length of an unnormalised URL allow - Perl regular to identify allowed URLs exclude - Perl regular expressions to exclude URLs from harvesting serveralias - Aliases of server names sessionids - List sessionid markers to be removed

A selurl object can hold a single URL and has methods to obtain its subparts as defined in URI.pm, plus some methods to normalise and validate it in Combine context.


BUGS

Currently, the only schemes supported are http, https and ftp. Others may or may not work correctly. For one thing, we assume the scheme has an internet hostname/port.

clone() will only return a copy of the real URI object, not a new selurl.

URI URI-escapes the strings fed into it by new() once. Existing percent signs in the input are left untouched, which implicates that:

(a) there is no risk of double-encoding; and

(b) if the original contained an inadvertent sequence that could be interpreted as an escape sequence, uri_unescape will not render the original input (e.g. url_with_%66_in_it goes whoop) If you know that the original has not yet been escaped and wish to safeguard potential percent signs, you'll have to escape them (and only them) once before you offer it to new().

A problem with URI is, that its object is not a hash we can piggyback our data on, so I had to resort to AUTOLOAD to emulate inheritance. I find this ugly, but well, this *is* Perl, so what'd you expect?


Combine::XWI


NAME

XWI.pm - class for internal representation of a document record


SYNOPSIS

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$
$weight[\mbox{location}_{j}]$
$hits[\mbox{location}_{j}][\mbox{term}_{i}]$
$\mbox{term}_{i}$
$\mbox{location}_{j}$


DESCRIPTION

Provides methods for storing and retrieving structured records representing crawled documents.


METHODS


new()

XXX($val)

Saves $val using AUTOLOAD. Can later be retrieved, eg

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}

will set $t to 'My value'


*_reset()

Forget all values.


*_rewind()

*_get will start with the first value.


*_add

stores values into the datastructure


*_get

retrieves values from the datastructure


meta_reset() / meta_rewind() / meta_add() / meta_get()

Stores the content of Meta-tags

Takes/Returns 2 parameters: Name, Content

\begin{displaymath}
\sum_{\mbox{all terms}} \left( \sum_{\mbox{all matches}}
\frac{weight[\mbox{term}_{i}]}{\log(k * position[\mbox{term}_{i}][\mbox{match}_{j}]) * proximity[\mbox{term}_{i}][\mbox{match}_{j}]} \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$

xmeta_reset() / xmeta_rewind() / xmeta_add() / xmeta_get()

Extended information from Meta-tags. Not used.


url_remove() / url_reset() / url_rewind() / url_add() / url_get()

Stores all URLs (ie if multiple URLs for the same page) for this record

Takes/Returns 1 parameter: URL


heading_reset() / heading_rewind() / heading_add() / heading_get()

Stores headings from HTML documents

Takes/Returns 1 parameter: Heading text


link_reset() / link_rewind() / link_add() / link_get()

Stores links from documents

Takes/Returns 5 parameters: URL, netlocid, urlid, Anchor text, Link type


robot_reset() / robot_rewind() / robot_add() / robot_get()

Stores calculated information, like genre, language, etc

Takes/Returns 2 parameters Name, Value. Both are strings with max length Name: 15, Value: 20


topic_reset() / topic_rewind() / topic_add() / topic_get()

Stores result of topic classification.

Takes/Returns 5 parameters: Class, Absolute score, Normalized score, Terms, Algorithm id

Class, Terms, and Algorithm id are strings with max lengths Class: 50, and Algorithm id: 25

Absolute score, and Normalized score are integers

Normalized score and Terms are optional and may be replaced with 0, and '' respectively


SEE ALSO

Combine focused crawler main site http://combine.it.lth.se/


AUTHOR

Yong Cao $<$tsao@munin.ub2.lu.se$>$ v0.05 1997-03-13

Anders Ardö, $<$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::Matcher


NAME

Matcher


DESCRIPTION

This a module in the DESIRE automatic classification system. Copyright 1999. Modified in the ALVIS project. Copyright 2004

Exported routines: 1. Fetching text: These routines all extract texts from a document (either a Combine XWI datastructure or a WWW-page identified by a URL. They all return: $meta, $head, $text, $url, $title, $size $meta: Metadata from document $head: Important text from document $text: Plain text from document $url: URL of the document $title: HTML title of the document $size: The size of the document

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$

2. Term matcher accepts a text as a (reference) parameter, matches each term in Term against text Matches are recorded in an associative array with class as key and summed weight as value. Match parameters: $text, $termlist $text: text to match against the termlist $termlist: object pointer to a LoadTermList object with a termlist loaded output: %score: an associative array with classifications as keys and scores as values


AUTHOR

Anders Ardö $>$anders.ardo@it.lth.se$<$


COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::FromTeX


NAME

Combine::FromTeX.pm - TeX parser in combine package


AUTHOR

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}


Combine::SD_SQL


NAME

SD_SQL


DESCRIPTION

Reimplementation of sd.pl SD.pm and SDQ.pm using MySQL contains both recyc and guard

Basic idea is to have a table (urldb) that contains most URLs ever inserted into the system together with a lock (the guard function) and a boolean harvest-flag. Also in this table is the host part together with its lock. URLs are selected from this table based on urllock, netloclock and harvest and inserted into a queue (table que). URLs from this queue are then given out to harvesters. The queue is implemented as: # The admin table can be used to generate sequence numbers like this: #mysql$>$ update admin set queid=LAST_INSERT_ID(queid+1); # and used to extract the next URL from the queue #mysql$<$ select host,url from que where queid=LAST_INSERT_ID(); # When the queue is empty it is filled from table urldb. Several different algorithms can be used to fill it (round-robin, most urls, longest time since harvest, ...). Since the harvest-flag and guard-lock are not updated until the actual harvest is done it is OK to delete the queue and regenerate it anytime.

########################## #Questions, ideas, TODOs, etc #Split table urldb into 2 tables - one for urls and one for hosts??? #Less efficient when filling que; more efficient when updating netloclock #Datastruktur TABLE hosts: create table hosts( host varchar(50) not null default '', netloclock int not null, retries int not null default 0, ant int not null default 0, primary key (host), key (ant), key (netloclock) );

############# Handle to many retries?

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$


AUTHOR

Anders Ardö $>$anders.ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::utilPlugIn


NAME

utilPlugIn


DESCRIPTION

Utilities for: * extracting text from XWI's * SVM classification * language and country identification


AUTHOR

Ignacio Garcia Dorado Anders Ardö $<$anders.ardo@eit.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2008 Ignacio Garcia Dorado, Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::FromHTML


NAME

Combine::FromHTML.pm - HTML parser in combine package


AUTHOR

Yong Cao $<$tsao@munin.ub2.lu.se$>$ v0.06 1997-03-19 Anders Ardø 1998-07-18 added $<$AREA ... HREF=link ...$>$ fixed $<$A ... HREF=link ...$>$ regexp to be more general Anders Ardö 2002-09-20 added 'a' as a tag not to be replaced with space added removal of Cntrl-chars and some punctuation marks from IP added $<$style$>$...$<$/style$>$ as something to be removed before processing beefed up compression of sequences of blanks to include $\backslash$240 (non-breakable space) changed 'remove head' before text extraction to handle multiline matching (which can be introduced by decoding html entities) added compress blanks and remove CRs to metadata-content Anders Ardö 2004-04 Changed extraction process dramatically


Combine::RobotRules


NAME

RobotRules.pm


AUTHOR

Anders Ardo version 1.0 2004-02-19


Combine::HTMLExtractor


NAME

HTMLExtractor


DESCRIPTION

Adopted from HTML::LinkExtractor - Extract links from an HTML document by D.H (PodMaster)


AUTHOR Anders Ardo

D.H (PodMaster)


LICENSE

Copyright (c) 2003 by D.H. (PodMaster). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.


Combine::LoadTermList


NAME

LoadTermList


DESCRIPTION

This a module in the DESIRE automatic classification system. Copyright 1999.

LoadTermList - A class for loading and storing a stoplist with single words a termlist with classifications and weights

\begin{displaymath}
\mbox{Relevance\_score} =
\end{displaymath}
\begin{displaymath}
\sum_{\mbox{all locations}} \left( \sum_{\mbox{all terms}} (hits[\mbox{location}_{j}][\mbox{term
}_{i}] * weight[\mbox{term}_{i}] * weight[\mbox{location}_{j}]) \right)
\end{displaymath}
$weight[\mbox{term}_{i}]$
$weight[\mbox{location}_{j}]$


AUTHOR

Anders Ardö $<$Anders.Ardo@it.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


Combine::classifySVM


NAME

classifySVM


DESCRIPTION

Classification plugin module using SVM (implementation SVMLight)

Uses SVM model loaded from file pointed to by configuration variable 'SVMmodel'


AUTHOR

Ignacio Garcia Dorado Anders Ardö $<$anders.ardo@eit.lth.se$>$


COPYRIGHT AND LICENSE

Copyright (C) 2008 Ignacio Garcia Dorado, Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/


root 2008-11-13