[
prev
] [
prev-tail
] [
tail
] [
up
]
Part I
Overview
1
Introduction
2
Open source distribution, installation
2.1
Installation
2.2
Getting started
2.3
Online documentation
2.4
Use scenarios
3
Configuration
3.1
Configuration files
4
Crawler internal operation
4.1
URL selection criteria
4.2
Document parsing and information extraction
4.3
URL filtering
4.4
Crawling strategy
4.5
Built-in topic filter – automated subject classification using string matching
4.6
Built-in topic filter – automated subject classification using SVM
4.7
Topic filter Plug-In API
4.8
Analysis
4.9
Duplicate detection
4.10
URL recycling
4.11
Database cleaning
4.12
Complete application – SearchEngine in a Box
5
Evaluation of automated subject classification
5.1
Approaches to automated classification
5.2
Evaluation methodology
5.3
Results
6
Performance and scalability
6.1
Speed
6.2
Space
6.3
Crawling strategy
7
System components
7.1
combineINIT
7.2
combineCtrl
7.3
combineUtil
7.4
combineExport
7.5
Internal executables and Library modules
References
[
prev
] [
prev-tail
] [
front
] [
up
]