Documentation for the Combine (focused) crawling system
Anders Ard
ö, Koraljka Golub
November 18, 2008
Contents
I
Overview
1
Introduction
2
Open source distribution, installation
2.1
Installation
2.2
Getting started
2.3
Online documentation
2.4
Use scenarios
3
Configuration
3.1
Configuration files
4
Crawler internal operation
4.1
URL selection criteria
4.2
Document parsing and information extraction
4.3
URL filtering
4.4
Crawling strategy
4.5
Built-in topic filter – automated subject classification using string matching
4.6
Built-in topic filter – automated subject classification using SVM
4.7
Topic filter Plug-In API
4.8
Analysis
4.9
Duplicate detection
4.10
URL recycling
4.11
Database cleaning
4.12
Complete application – SearchEngine in a Box
5
Evaluation of automated subject classification
5.1
Approaches to automated classification
5.2
Evaluation methodology
5.3
Results
6
Performance and scalability
6.1
Speed
6.2
Space
6.3
Crawling strategy
7
System components
7.1
combineINIT
7.2
combineCtrl
7.3
combineUtil
7.4
combineExport
7.5
Internal executables and Library modules
References
II
Gory details
8
Frequently asked questions
9
Configuration variables
9.1
Name/value configuration variables
9.2
Complex configuration variables
10
Module dependences
10.1
Programs
10.2
Library modules
10.3
External modules
III
A
APPENDIX
A.1
Simple installation test
A.2
Example topic filter plug in
A.3
Default configuration files
A.4
SQL database
A.5
Manual pages