SAP NetWeaver '04

Package com.sapportals.wcm.service.crawler

Provides a service that crawls repositories to obtain references to resources.

See:
          Description

Interface Summary
ICrawler Deprecated. as of NW04.
ICrawlerList Deprecated. as of NW04.
ICrawlerListIterator Deprecated. as of NW04.
ICrawlerListResultReceiver Deprecated. as of NW04.
ICrawlerProfile Deprecated. as of NW04.
ICrawlerProfileList Deprecated. as of NW04.
ICrawlerProfileListIterator Deprecated. as of NW04.
ICrawlerPushedDeltaResultReceiver Deprecated. as of NW04.
ICrawlerPushedResultReceiver Deprecated. as of NW04.
ICrawlerQueue Deprecated. as of NW04.
ICrawlerResultReceiver Deprecated. as of NW04.
ICrawlerResultReceiverExtension Deprecated. as of NW04.
ICrawlerService Deprecated. as of NW04.
ICrawlerStatistics Deprecated. as of NW04.
ICrawlerVisitedEntry Deprecated. as of NW04.
ICrawlerVisitedEntryIterator Deprecated. as of NW04.
ICrawlerVisitedEntryList Deprecated. as of NW04.
ICrawlerVisitedEntryListIterator Deprecated. as of NW04.
ICrawlerVisitedList Deprecated. as of NW04.
IScheduledCrawler Deprecated. as of NW04.
IScheduledCrawlerList Deprecated. as of NW04.
IScheduledCrawlerListIterator Deprecated. as of NW04.
ISpecialCrawler Deprecated. as of NW04.
 

Class Summary
AbstractListResultReceiver Deprecated. as of NW04.
AbstractPushedResultReceiver Deprecated. as of NW04.
CrawlerUtils Deprecated. as of NW04.
ICrawlerComparator Deprecated. as of NW04.
 

Exception Summary
CrawlerRunningException Deprecated. as of NW04.
 

Package com.sapportals.wcm.service.crawler Description

Provides a service that crawls repositories to obtain references to resources.

Package Specification

Purpose
Detailed_Concept
Interfaces_and_Classes
Code_Samples
Configuration
Related_Documentation

Purpose

The crawler service provides functions to create and manage crawlers. Crawlers are used to determine all the resources contained in a Content Management (CM) repository and to obtain references to them. The behavior of crawlers can be controlled in various ways. They can be instructed to find resources that match certain conditions.

Various applications use the crawler. For example, the CM indexing service uses the crawler in a preparatory step to build indexes for search and classification operations. It calls the crawler to get references to all the resources in a directory. It then passes the references to a search engine which uses them to access and analyse the target resources and to build the index. Similarly, the CM subscription service also makes use of the crawler. It schedules the crawler to find out the contents of directories at regular intervals. On the basis of the information returned, it can determine whether any objects in the directories have changed since the last crawl.

Detailed Concept

The core tasks of the crawler is simply to collect references to all the objects in a repository and to pass these on for processing. In order to be able to perform this task, it needs the support of the repository manager to retrieve the objects and the assistance of a special object, a results receiver, to accept and process the objects. In this respect the CM crawler differs from other crawler implementations which generally combine the retrieval, collection and result processing functions in a single object.

The following aims to give you a technical understanding of the CM crawler. It describes how the crawler operates in different repository types and how it interacts with the repository manager and the results receiver to perform its task. It also introduces the different types of crawler that are implemented and points out features that allow you to control their behavior.

Crawling Techniques

The crawler service offers two basic crawling techniques: one for hierarchical repositories and one for web repositories. In a hierarchical repository, the crawler basically has to find the children of all resources, in a web repository it has to find all linked resources. The graphic illustrates these two procedures:

Crawling techniques

Crawling a Hierarchy

When the crawler works its way through a hierarchical repository, it needs to know a start resource before it can begin operating. Once this is clear, it simply repeatedly applies the getChildren method to each collection. In this way, it is able to access all resources until the hierarchy is exhausted. The resources are passed on to a results receiver object for processing.
The crawler is assisted by a repository manager which retrieves the resources from the corresponding repository whenever the getChildren method is executed.

Crawling a Web Site

When the crawler works its way through a web repository, it needs to know a start resource and then follows all the links leading from here. The crawler cannot recognize the links within an HTML page and relies on the support of the repository manager to identify them.
In summary, the procedure to crawl a web site is as follows:

Crawler Types

Content Management has implemented three variants of the crawler:

The standard crawler is able to work through both hierarchical and web repositories, however it is more suitable for hierarchical repositories. Per default it is set to crawling hierarchical repositories. When it crawls through a repository it uses a depth-first strategy. This means it starts on the top left node, works its way down to the bottom of the hierarchy and then continues with the top node that is second-to-left..

The web crawler is able to work through both hierarchical and web repositories, however it is more suitable for web repositories. Per default it is set to crawl web repositories. When it crawls it uses a breadth-first approach. This means it works its way through each level of links sequentially. First it follows all links on the first level, then all links on the second level and so on.

The threaded crawler is identical to a web crawler except that it can execute retrieval and provider processes in parallel using threads. The retriever part of the crawler is responsible for getting the URLs of a resources and transferring them to a list. The provider is responsible for passing the URLs on to a results receiver object that handles the processing of the URLS.

Crawler Features

In addition to enabling and optimizing the three basic crawler types described above, the crawler service also offers a number of useful functions that influence the behavior of the crawler. For example, they can control:

Visited List

The crawler can be instructed to remember information about each resource it has crawled. The information is useful to avoid recrawling already visited resources and determining whether a resource has been changed since the last crawl. A list of resources that has been crawled can be stored in different ways:

Delta Crawling

Crawlers can be used to perform a delta crawl. In this case they only return references to resources that have changed since the last crawl. A prerequisite for a delta crawl is that a list of the visited resources is stored on the database. Also an ICrawlerPushedDeltaResultReceiver must be implemented. The delta crawler can analyse different types of data to determine whether resources have changed. The following alternatives are available and can be set with parameters:

Queuing

The crawler usually has to work through a large amount of distributed data, it is therefore often necessary to use many crawlers simultaneously. As this can result in an excessive system load, the crawlers can be registered in an ICrawlerQueue and then be started in groups at scheduled times. The crawler queue is used in combination with the scheduler service.

Interfaces and Classes

The crawler service offers a number of interfaces that you can use to work with crawlers. The most important interfaces are:

The graphic illustrates the central relationships between ICrawlerService, ICrawler and other interfaces.

The graphic illustrates the relationship between the ICrawler and the various results receiver interfaces:

 

 

 

Code Samples

If you want to use the crawler in an application, in summary you need to do the following:

  1. Get an instance of the crawler service.
  2. Create a crawler with one of the create methods offered by the crawler service. You need to provide a start resource and generally also an object to receive and process the results collected by the crawler.
    The object to receive the crawling results can be an:
  1. Start the crawler.

Configuration

The behavior of the crawler is influenced by parameters that are grouped into two categories: crawler parameters and profile parameters.
Crawler Parameters are generally only valid for a specific crawler type like a threaded crawler whereas profile parameters are valid for all crawler types. They are stored in an IProfile object.

Before you can use a crawler, you must ensure that it is configured correctly. Keep in mind that only repositories that have been defined within Content Management can be crawled.

The parameters shown in the table can be used to configure the crawler.
The parameters can be set with the help of the user interface of the Content Management configuration framework.
For more information, see the documentation Administering Content Management.

The parameters in the table that are not required have default values that are used if no value is specified.

Parameter Required Description
class yes The class that is used for this system as crawler service, usually com.sapportals.wcm.service.crawler.wcm.CrawlerService
expiration no The exipiration time for the crawler statistics in seconds (this determines, how long crawlers are kept in the list to be displayed in the crawler administration control),
default: 1800.
list yes A comma separated list of IDs for the available  crawlers.
<ID>.class yes The class that is to used for the crawler with the specified ID.
Possible values are com.sapportals.wcm.service.crawler.wcm.CrawlerTypeStandard,com.sapportals.wcm.service.crawler.wcm.CrawlerTypeWeb and com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded.
<ID>.checksum no The class that handles the checksum calculation for the (delta)-crawler with the given ID.
Possible values are com.sapportals.wcm.service.crawler.wcm.CheckSumUnused(default),com.sapportals.wcm.service.crawler.wcm.CheckSumCRC or com.sapportals.wcm.service.crawler.wcm.CheckSumETag
(this parameter should be set to com.sapportals.wcm.service.crawler.wcm.CheckSumUnused if  ID.visitedlist is not com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase)
<ID>.visitedlist no The class that is to be used to store the visited entries for the crawler with the given ID.
Possible values are com.sapportals.wcm.service.crawler.wcm.VisitedListUnused,com.sapportals.wcm.service.crawler.wcm.VisitedListMemory(default) and com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase.
<ID>.poolid no The ID of the connection to use for accessing the database.
default: dbcon_rep
(this parameter is only used if <ID>.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase)
<ID>.tablename no The name of the database table to store the visited entries in,
default: wcm_crawler
(this parameter is only used if <ID>.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase)
<ID>.foundcapacity no The limit for the number of entries in the queue (-1 for unlimited),
default: -1.
<ID>.retrievers no The number of threads which retrieve resources and put them as found entries to the queue,
default: 8.
(this parameter is only used if <ID>.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded)
<ID>.providers no The number of threads which process found entries,
default: 8.
(this parameter is only used if <ID>.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded)

The following is a sample configuration entry:

service.crawler.class = com.sapportals.wcm.service.crawler.wcm.CrawlerService
service.crawler.expiration = 900
service.crawler.list = standard, web, threaded

service.crawler.standard.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeStandard
service.crawler.standard.checksum = com.sapportals.wcm.service.crawler.wcm.CheckSumUnused
service.crawler.standard.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListMemory
service.crawler.standard.foundcapacity = -1

service.crawler.web.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeWeb
service.crawler.web.checksum = com.sapportals.wcm.service.crawler.wcm.CheckSumETag
service.crawler.web.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase
service.crawler.web.poolid = dbcon_webcrawl
service.crawler.web.foundcapacity = -1

service.crawler.threaded.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded
service.crawler.threaded.checksum = com.sapportals.wcm.service.crawler.wcm.CheckSumCRC
service.crawler.threaded.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase
service.crawler.threaded.poolid = dbcon_webcrawl
service.crawler.threaded.foundcapacity = -1
service.crawler.threaded.retrievers = 10
service.crawler.threaded.providers = 3

Implementation Notes

Crawler Limitations
The crawling process described above makes it clear that the crawler, as it is implemented in the Content Management environment, has limited capabilities and relies on the repository manager and the results receiver object to be able to do its work. Essentially the crawler only collects the objects. They are retrieved by the repository manager and processed by a results receiver object.

Note that the crawler is only able to work through static pages. It cannot, for example, work through pages that include Javascript. It also does not yet recognize the entries in a robot file assigned to an HTML file.

The sequence diagram shows the process and objects involved when a client uses a crawler:

 

Related Documentation

com.sapportals.wcm.service.scheduler


SAP NetWeaver '04

Copyright © 2004 by SAP AG. All Rights Reserved.
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.

These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.