SAP NetWeaver '04

Package com.sapportals.wcm.util.html

Contains classses that handle the parsing of HTML.

See:
          Description

Interface Summary
IHTMLContentHandler IHTMLContentHandler receives events from a IHTMLReader.
IHTMLElement Represents a HTML tag for an event.
IHTMLElementStart Extends IHTMLElement to handle attributes.
IHTMLFilter Processes HTML events from a parent reader.
IHTMLReader Reads HTML documents and generates events.
 

Class Summary
HTMLFilterImpl Default Implementation of IHTMLFilter.
HTMLInputStream A InputStream on top of a IHTMLReader.
HTMLReaderFactory HTMLReaderFactory creates instances of IHTMLReader.
HTMLScriptRemover Removes script content and noscript tags.
HTMLStreamWriter Writes events from a IHTMLReader onto a stream.
HtmlTag Copyright (c) SAP AG 2001-2002
HtmlTokenizer HtmlTokenizer Copyright (c) SAP AG 2001-2003
 

Exception Summary
HTMLException HTMLException is the base class for all exceptions in this package.
 

Package com.sapportals.wcm.util.html Description

Contains classses that handle the parsing of HTML.

Package Specification

The package offers two styles of HTML parsing: push and pull.

Pull Parsing

HtmlTokenizer and HtmlTag implement a "pull"-style parsing of HTML documents.

The client of HtmlTokenizer calls next() until the end of the document is reached. The tokenizer returns the type of the next parsed token and also its string content. A client can then use HtmlTag to access string content of a TAG token in a structured way.

Push Parsing

IHTMLReader and IHTMLContentHandler are the basic interfaces for "push"-tyle parsing of HTML documents.

IHTMLReader follows closely the SAX API approach. A content handler is installed in a reader which receives events for every parsed document part. A client of IHTMLReader invokes parse() on the reader whereas the complete document is read. During this, all events are sent to the installed content handler.

As a mixture betwenn "push" and "pull", IHTMLReader also offers a way of "controled-push" parsing. The client can invoke parseNextEvent(), whereas the reader sends one event to the content handler and returns to the client afterwards.

Character Encoding

Both push and pull parsers can detect the character encoding of the given HTML document. Both parsers use the <meta> tag as explained here.

Filtering

IHTMLFilter is a general filter interface for the push parser. Filters can be chained and appear as a IHTMLReader to the client. Each filter installs itself as content handler in its IHTMLReader.
There is a default implementation in HTMLFilterImpl which implements the identity function, e.g. all events are forwarded unchanged.

Output

The output of a filter/reader chain can be sent to OutputStream or Writer by using the HTMLStreamWriter. Likewise the output from a filter/reader can be used as InputStream to read from by using the HTMLInputStream.

XHTML

It is possible to parse XHTML or even plain XML documents with the parsers in this package. By default, both parsers make no attempt to validate the document or enforce any structure (not even that the first tag is <html>). The basic working assumption for the parsers is: "report anything which does not look like a tag as text token/event."

Both parsers do not care about namespace declarations (reporting them as attributes on the tag/element) or even namespace prefixes. IHTMLReader elements only have a name where the prefix is part of. As a consequence XHTML documents which use a non-empty namespace prefix for the xhtml namespace, will not be properly handled by content handlers.

Related Documentation


SAP NetWeaver '04

Copyright © 2004 by SAP AG. All Rights Reserved.
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.

These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.