Next:
Introduction
Up:
Documentation for the Combine
Previous:
Contents
Contents
Overview
Subsections
Introduction
Open source distribution, installation
Installation
Installation from source for the impatient
Porting to not supported operating systems - dependencies
Automated Debian/Ubuntu installation
Manual installation
Out-of-the-box installation test
Getting started
Online documentation
Use scenarios
General crawling without restrictions
Focused crawling - domain restrictions
Focused crawling - topic specific
Focused crawling in an Alvis system
Crawl one entire site and it's outlinks
Configuration
Configuration files
Templates
Global configuration files
Job specific configuration files
Details and default values
Crawler internal operation
URL selection criteria
Document parsing and analysis
URL filtering
Crawling strategy
Built-in topic filter - automated subject classification
Topic definition
Topic definition (term triplets) BNF grammar
Term triplet examples
Algorithm 1: plain matching
Algorithm 2: position weighted matching
Topic filter Plug-In API
Analysis
Duplicate detection
URL recycling
Database cleaning
Complete application - SearchEngine in a Box
Evaluation of automated subject classification
Approaches to automated classification
Description of the used string-matching algorithm
Evaluation methodology
Evaluation challenge
Evaluation measures used
Data collection
Results
The role of different thesauri terms
Enriching the term list using natural language processing
Importance of HTML structural elements and metadata
Challenges and recommendations for classification of Web pages
Comparing and combining two approaches
Performance and scalability
Speed
Space
Crawling strategy
System components
combineINIT
combineCtrl
combineUtil
combineExport
Internal executables and Library modules
Library
root 2007-03-07