|
SAP NetWeaver '04 | |||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
| ICrawler | Deprecated. as of NW04. |
| ICrawlerList | Deprecated. as of NW04. |
| ICrawlerListIterator | Deprecated. as of NW04. |
| ICrawlerListResultReceiver | Deprecated. as of NW04. |
| ICrawlerProfile | Deprecated. as of NW04. |
| ICrawlerProfileList | Deprecated. as of NW04. |
| ICrawlerProfileListIterator | Deprecated. as of NW04. |
| ICrawlerPushedDeltaResultReceiver | Deprecated. as of NW04. |
| ICrawlerPushedResultReceiver | Deprecated. as of NW04. |
| ICrawlerQueue | Deprecated. as of NW04. |
| ICrawlerResultReceiver | Deprecated. as of NW04. |
| ICrawlerResultReceiverExtension | Deprecated. as of NW04. |
| ICrawlerService | Deprecated. as of NW04. |
| ICrawlerStatistics | Deprecated. as of NW04. |
| ICrawlerVisitedEntry | Deprecated. as of NW04. |
| ICrawlerVisitedEntryIterator | Deprecated. as of NW04. |
| ICrawlerVisitedEntryList | Deprecated. as of NW04. |
| ICrawlerVisitedEntryListIterator | Deprecated. as of NW04. |
| ICrawlerVisitedList | Deprecated. as of NW04. |
| IScheduledCrawler | Deprecated. as of NW04. |
| IScheduledCrawlerList | Deprecated. as of NW04. |
| IScheduledCrawlerListIterator | Deprecated. as of NW04. |
| ISpecialCrawler | Deprecated. as of NW04. |
| Class Summary | |
| AbstractListResultReceiver | Deprecated. as of NW04. |
| AbstractPushedResultReceiver | Deprecated. as of NW04. |
| CrawlerUtils | Deprecated. as of NW04. |
| ICrawlerComparator | Deprecated. as of NW04. |
| Exception Summary | |
| CrawlerRunningException | Deprecated. as of NW04. |
Provides a service that crawls repositories to obtain references to
resources.
Purpose
Detailed_Concept
Interfaces_and_Classes
Code_Samples
Configuration
Related_Documentation
The crawler service provides functions to create and manage crawlers. Crawlers are used to determine all the resources contained in a Content Management (CM) repository and to obtain references to them. The behavior of crawlers can be controlled in various ways. They can be instructed to find resources that match certain conditions.
Various applications use the crawler. For example, the CM indexing service uses the crawler in a preparatory step to build indexes for search and classification operations. It calls the crawler to get references to all the resources in a directory. It then passes the references to a search engine which uses them to access and analyse the target resources and to build the index. Similarly, the CM subscription service also makes use of the crawler. It schedules the crawler to find out the contents of directories at regular intervals. On the basis of the information returned, it can determine whether any objects in the directories have changed since the last crawl.
The core tasks of the crawler is simply to collect references to all the objects in a repository and to pass these on for processing. In order to be able to perform this task, it needs the support of the repository manager to retrieve the objects and the assistance of a special object, a results receiver, to accept and process the objects. In this respect the CM crawler differs from other crawler implementations which generally combine the retrieval, collection and result processing functions in a single object.
The following aims to give you a technical understanding of the CM crawler. It describes how the crawler operates in different repository types and how it interacts with the repository manager and the results receiver to perform its task. It also introduces the different types of crawler that are implemented and points out features that allow you to control their behavior.
The crawler service offers two basic crawling techniques: one for hierarchical repositories and one for web repositories. In a hierarchical repository, the crawler basically has to find the children of all resources, in a web repository it has to find all linked resources. The graphic illustrates these two procedures:

When the crawler works its way through a
hierarchical repository, it needs to know a start resource before it can
begin operating. Once this is clear, it simply repeatedly applies the getChildren method to each
collection. In this way, it is able to access all resources until the
hierarchy is exhausted. The resources are passed on to a results
receiver object for processing.
The crawler is assisted by a repository manager which retrieves the
resources from the corresponding repository whenever the getChildren method is executed.
When the crawler works its way through a web
repository, it needs to know a start resource and then follows all the
links leading from here. The crawler cannot recognize the links within
an HTML page and relies on the support of the repository manager to
identify them.
In summary, the procedure to crawl a web site is as follows:
Content Management has implemented three variants of the crawler:
The standard crawler is able to work through both hierarchical and web repositories, however it is more suitable for hierarchical repositories. Per default it is set to crawling hierarchical repositories. When it crawls through a repository it uses a depth-first strategy. This means it starts on the top left node, works its way down to the bottom of the hierarchy and then continues with the top node that is second-to-left..
The web crawler is able to work through both hierarchical and web repositories, however it is more suitable for web repositories. Per default it is set to crawl web repositories. When it crawls it uses a breadth-first approach. This means it works its way through each level of links sequentially. First it follows all links on the first level, then all links on the second level and so on.
The threaded crawler is identical to a web crawler except that it can execute retrieval and provider processes in parallel using threads. The retriever part of the crawler is responsible for getting the URLs of a resources and transferring them to a list. The provider is responsible for passing the URLs on to a results receiver object that handles the processing of the URLS.
In addition to enabling and optimizing the three basic crawler types described above, the crawler service also offers a number of useful functions that influence the behavior of the crawler. For example, they can control:
The crawler can be instructed to remember information about each resource it has crawled. The information is useful to avoid recrawling already visited resources and determining whether a resource has been changed since the last crawl. A list of resources that has been crawled can be stored in different ways:
Delta Crawling
Crawlers can be used to perform a delta crawl. In this case they only return references to resources that have changed since the last crawl. A prerequisite for a delta crawl is that a list of the visited resources is stored on the database. Also an ICrawlerPushedDeltaResultReceiver must be implemented. The delta crawler can analyse different types of data to determine whether resources have changed. The following alternatives are available and can be set with parameters:
Queuing
The crawler usually has to work through a large amount of distributed data, it is therefore often necessary to use many crawlers simultaneously. As this can result in an excessive system load, the crawlers can be registered in an ICrawlerQueue and then be started in groups at scheduled times. The crawler queue is used in combination with the scheduler service.
The crawler service offers a number of interfaces that you can use to work with crawlers. The most important interfaces are:
The graphic illustrates the central relationships between ICrawlerService, ICrawler and other interfaces.

The graphic illustrates the relationship between the ICrawler and the various results receiver interfaces:

If you want to use the crawler in an application, in summary you need to do the following:
- ICrawlerListResultReceiver object that receives a list of references to all the objects collected by the crawler. If large amounts of data are crawled, this type of results receiver can cause memory problems.
- ICrawlerPushedResultReceiver object that receives a reference for each crawled object individually. This type of results receiver is recommended when a large amount of data has to be crawled and collecting all the result objects can lead to memory problems.
- ICrawlerPushedDeltaResultReceiverfor a delta crawl
- A variant of the crawler, an ISpecialCrawler, that includes a results receiver. This results receiver simply offers a method to accept a list of returned results, but does not include any processing logic.
The behavior of the crawler is influenced by
parameters that are grouped into two categories: crawler
parameters and profile parameters.
Crawler Parameters are generally only valid for a specific crawler type
like a threaded crawler whereas profile parameters are valid for all
crawler types. They are stored in an IProfile
object.
Before you can use a crawler, you must ensure that it is configured correctly. Keep in mind that only repositories that have been defined within Content Management can be crawled.
The parameters shown in the table can be used to configure the
crawler.
The parameters can be set with the help of the user interface of the
Content Management configuration framework.
For more information, see the documentation Administering Content
Management.
The parameters in the table that are not required have default
values that are used if no value is specified.
| Parameter | Required | Description |
class |
yes | The class that is used for this system as crawler service, usually com.sapportals.wcm.service.crawler.wcm.CrawlerService |
expiration |
no | The exipiration time for the crawler statistics
in seconds (this determines, how long crawlers are kept in the
list to be displayed in the crawler administration control), default: 1800. |
list |
yes | A comma separated list of IDs for the available crawlers. |
<ID>.class |
yes | The class that is to used for the crawler with the specified
ID. Possible values are com.sapportals.wcm.service.crawler.wcm.CrawlerTypeStandard,com.sapportals.wcm.service.crawler.wcm.CrawlerTypeWeb and com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded. |
<ID>.checksum |
no | The class that handles the checksum calculation for the
(delta)-crawler with the given ID. Possible values are com.sapportals.wcm.service.crawler.wcm.CheckSumUnused(default),com.sapportals.wcm.service.crawler.wcm.CheckSumCRC or com.sapportals.wcm.service.crawler.wcm.CheckSumETag (this parameter should be set to com.sapportals.wcm.service.crawler.wcm.CheckSumUnused if ID.visitedlist is not com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase) |
<ID>.visitedlist |
no | The class that is to be used to store the visited entries for
the crawler with the given ID. Possible values are com.sapportals.wcm.service.crawler.wcm.VisitedListUnused,com.sapportals.wcm.service.crawler.wcm.VisitedListMemory(default) and com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase. |
| <ID>.poolid | no | The ID of the connection to use for accessing the database. default: dbcon_rep (this parameter is only used if <ID>.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase) |
<ID>.tablename |
no | The name of the database table to store the visited entries
in, default: wcm_crawler (this parameter is only used if <ID>.visitedlist = com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase) |
<ID>.foundcapacity |
no | The limit for the number of entries in the queue (-1 for
unlimited), default: -1. |
<ID>.retrievers |
no | The number of threads which retrieve resources and put them
as found entries to the queue, default: 8. (this parameter is only used if <ID>.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded) |
<ID>.providers |
no | The number of threads which process found entries, default: 8. (this parameter is only used if <ID>.class = com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded) |
The following is a sample configuration entry:
service.crawler.class =
com.sapportals.wcm.service.crawler.wcm.CrawlerService
service.crawler.expiration = 900
service.crawler.list = standard, web, threaded
service.crawler.standard.class =
com.sapportals.wcm.service.crawler.wcm.CrawlerTypeStandard
service.crawler.standard.checksum =
com.sapportals.wcm.service.crawler.wcm.CheckSumUnused
service.crawler.standard.visitedlist =
com.sapportals.wcm.service.crawler.wcm.VisitedListMemory
service.crawler.standard.foundcapacity = -1
service.crawler.web.class =
com.sapportals.wcm.service.crawler.wcm.CrawlerTypeWeb
service.crawler.web.checksum =
com.sapportals.wcm.service.crawler.wcm.CheckSumETag
service.crawler.web.visitedlist =
com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase
service.crawler.web.poolid = dbcon_webcrawl
service.crawler.web.foundcapacity = -1
service.crawler.threaded.class =
com.sapportals.wcm.service.crawler.wcm.CrawlerTypeThreaded
service.crawler.threaded.checksum =
com.sapportals.wcm.service.crawler.wcm.CheckSumCRC
service.crawler.threaded.visitedlist =
com.sapportals.wcm.service.crawler.wcm.VisitedListDatabase
service.crawler.threaded.poolid = dbcon_webcrawl
service.crawler.threaded.foundcapacity = -1
service.crawler.threaded.retrievers = 10
service.crawler.threaded.providers = 3
Note that the crawler is only able to work through static pages. It cannot, for example, work through pages that include Javascript. It also does not yet recognize the entries in a robot file assigned to an HTML file.
The sequence diagram shows the process and objects involved when a client uses a crawler:

com.sapportals.wcm.service.scheduler
|
SAP NetWeaver '04 | |||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||