Next:
Contents
Contents
Documentation for the Combine (focused) crawling system
Anders Ardö, Koraljka Golub
Contents
Overview
Introduction
Open source distribution, installation
Installation
Getting started
Online documentation
Use scenarios
Configuration
Configuration files
Crawler internal operation
URL selection criteria
Document parsing and information extraction
URL filtering
Crawling strategy
Built-in topic filter - automated subject classification using string matching
Built-in topic filter - automated subject classification using SVM
Topic filter Plug-In API
Analysis
Duplicate detection
URL recycling
Database cleaning
Complete application - SearchEngine in a Box
Evaluation of automated subject classification
Approaches to automated classification
Evaluation methodology
Results
Performance and scalability
Speed
Space
Crawling strategy
System components
combineINIT
combineCtrl
combineUtil
combineExport
Internal executables and Library modules
Bibliography
Gory details
Frequently asked questions
Configuration variables
Name/value configuration variables
Complex configuration variables
Module dependences
Programs
Library modules
External modules
.
APPENDIX
Simple installation test
Example topic filter plug in
Default configuration files
SQL database
Manual pages
About this document ...
root 2008-10-13