Fit Specification: Parsing

 

This portion of the Fit specification describes how Fit parses HTML documents.

 

Contents:

 

HTML Document Parsing. 1

Other HTML. 2

Complicated Tables. 3

Malformed HTML. 3

Table Parsing. 4

Cells. 4

Tags. 4

Implementing Table Parsing. 5

HTML To Text.. 5

Character Conversion. 5

Microsoft Word. 6

Leading and Trailing Whitespace. 7

Adjoining Whitespace. 8

Other HTML. 8

Run Results. 8

 

 

HTML Document Parsing

 

Fit parses the tables from HTML documents into a data structure.  For example, in the following table, raw HTML is shown on the left and Fit’s view of the HTML is shown on the right.  (Table cells are in brackets, rows are per line, and tables are separated by dashes.)

 

fat.DocumentParseFixture

 

HTML

Structure()

<table>

  <tr><td>1</td></tr>

</table>

[1]

<table>

  <tr><td>1</td>   <td>2</td></tr>

  <tr><td>3</td>   <td>4</td></tr>

</table>

[1] [2]

[3] [4]

<table>

  <tr><td>1</td>   <td>2</td></tr>

  <tr><td>3</td>   <td>4</td></tr>

</table>

<table>

  <tr><td>5</td></tr>

  <tr><td>6</td></tr>

</table>

[1] [2]

[3] [4]

----

[5]

[6]

 

Other HTML

 

Everything but table structure and cell contents are ignored.  The ignored portions are preserved so they can be output again later, exactly as they were read in.

 

fat.DocumentParseFixture

 

 

HTML

Structure()

Output()

<HTML>

<body>Text before table...

<table>

  <tr><td>1</td></tr>

</table>

Text after table...</body>

</HTML>

[1]

<HTML>

<body>Text before table...

<table>

  <tr><td>1</td></tr>

</table>

Text after table...</body>

</HTML>

<table>

Text in table

<tr>

  Text in row

  <td>Text in cell</td>

  more row

</tr>

more table</table>

[Text in cell]

<table>

Text in table

<tr>

  Text in row

  <td>Text in cell</td>

  more row

</tr>

more table</table>

<table cellpadding=”3”>

  <tr attribute=”yes”><td align=”top”>Cell</td></tr>

</table>

[Cell]

<table cellpadding=”3”>

  <tr attribute=”yes”><td align=”top”>Cell</td></tr>

</table>

 

Even whitespace is preserved.

 

fat.DocumentParseFixture

 

 

HTML

Structure()

Output()

<HTML><body><table>

  <tr><td>1</td></tr>

</table></body></HTML>

[1]

<HTML><body><table>

  <tr><td>1</td></tr>

</table></body></HTML>

<HTML>

  <body>

    <table>

      <tr>

        <td>1</td>

      </tr>

    </table>

  </body>

</HTML>

[1]

<HTML>

  <body>

    <table>

      <tr>

        <td>1</td>

      </tr>

    </table>

  </body>

</HTML>

 

Complicated Tables

 

The colspan and rowspan attributes of table cells are also ignored, but jagged tables (tables with a varying number of cells in each row) are okay:

fat.DocumentParseFixture

 

HTML

Structure()

<table>

  <tr><td>1</td></tr>

  <tr><td>2</td>   <td>3</td>   <td>4</td></tr>

  <tr><td>5</td>   <td>6</td></tr>

</table>

[1]

[2] [3] [4]

[5] [6]

<table>

  <tr><td rowspan=2>1</td>   <td>2</td>   <td>3</td></tr>

  <tr><td colspan=2>4</td>   <td>5</td></tr>

</table>

[1] [2] [3]

[4] [5]

 

Malformed HTML

 

Tables that are missing “table,” “tr,” or “td” tags generate an error.

 

fat.DocumentParseFixture

 

 

 

HTML

Structure()

Output()

Note

<table>

  <tr><td>1</td>

</table>

error

error

no ending <tr> tag

<tr><td>1</td></tr>

error

error

no <table> tag

<table>

  <td>1</td>

</table>

error

error

no <tr> tag

<table>

  <tr><td>1</tr>

</table>

error

error

no ending </td> tag

 

Tables containing unclosed “table,” “tr,” and “td” tags also generate an error.

 

fat.DocumentParseFixture

 

HTML

Structure()

<table>

  <tr><td>1</td></tr>

  <table>

  <tr><td>2</td></tr>

</table>

error

<table>

  <tr><td>1</td></tr>

  <tr>

  <tr><td>2</td></tr>

</table>

error

<table>

  <tr><td>1</td><td></tr>

</table>

error

 

However, excess closing tags don’t generate an error.

 

fat.DocumentParseFixture

 

HTML

Structure()

<table>

  <tr><td>1</td></td></tr>

  <tr><td>2</td></tr>

  </tr>

</table>

</table>

[1]

[2]

 

HTML mistakes that aren’t related to tables are ignored.

 

fat.DocumentParseFixture

 

HTML

Structure()

<table>

  <tr><badTag...<td>1</td></tr>

</table>

[1]

 

Table Parsing

 

Fixtures (described in the fixtures specification) are given a data structure representing their table. 

 

Cells

 

Fixtures can look at the body of any individual cell in the table.

 

fat.TableParseFixture

 

 

 

HTML

Row

Column

CellBody()

<table>

  <tr><td>top left</td><td>top right</td></tr>

  <tr><td>bottom left</td><td>bottom right</td></tr>

</table>

1

1

top left

 

1

2

top right

 

2

1

bottom left

 

2

2

bottom right

 

When fixtures look at the contents of a cell, they get the full HTML markup in that cell.  (They can also ask for HTML to text conversion as described below.)

fat.TableParseFixture

 

 

 

HTML

Row

Column

CellBody()

<table>

  <tr><td>text with a <tag /></td></tr>

</table>

1

1

text with a <tag />

 

Tags

 

Fixtures can also look at the tags themselves.  Fixtures see any attributes that were in the source HTML.

 

fat.TableParseFixture

 

 

 

HTML

Row

Column

CellTag()

<table>

  <tr><td align=”top”>text</td></tr>

</table>

1

1

<td align=”top”>

 

This applies to row tags as well...

 

fat.TableParseFixture

 

 

HTML

Row

RowTag()

<table>

  <tr bgcolor=”black”><td>text</td></tr>

  <tr bgcolor=”white”><td>text</td></tr>

</table>

1

<tr bgcolor=”black”>

 

2

<tr bgcolor=”white”>

 

...and even to table tags.

 

fat.TableParseFixture

 

HTML

TableTag()

<table border=”1”>

  <tr><td>text</td></tr>

</table>

<table border=”1”>

 

Implementing Table Parsing

 

Fit implementations may provide functions or methods to access the data structure in any way they wish.  Services not described here (such as a method to parse out individual attributes from a tag) are not an official part of Fit at this time.  They may be added given demand; if you feel a particular service should be added, tell the Fit developers.

 

HTML To Text

 

Fixtures may ask Fit to convert HTML into a string.  Fit strives to “render” the HTML in the same way a browser would, so that fixtures see the same result human readers do.

 

Character Conversion

 

These specific HTML entities are converted into characters:

 

fat.HtmlToTextFixture

 

HTML

Text()

&amp;

&

(&nbsp;)

( )

&lt;

<

&gt;

>

&quot;

"

 

The non-breaking space character is converted into a normal space.

 

fat.HtmlToTextFixture

 

HTML

Text()

(\u00a0)

( )

 

Non-ASCII characters are preserved as-is.

 

fat.HtmlToTextFixture

 

HTML

Text()

ń

ń

 

Line break tags are converted into ASCII 10 line feed characters (shown here as “\n”).

 

fat.HtmlToTextFixture

 

HTML

Text()

intentional<br>line-break

intentional\nline-break

another form<br />of line-break

another form\nof line-break

yet<br/>more<br />forms<  br   /   >

yet\nmore\nforms\n

 

Microsoft Word

 

Fit has a few special conversion rules for HTML created by Microsoft Word.

 

“Smart quotes” are converted to regular quotes.

 

fat.HtmlToTextFixture

 

HTML

Text()

“double-quotes”

"double-quotes"

‘single quotes’

'single quotes'

 

Word’s use of paragraph tags for line breaks is supported.

 

fat.HtmlToTextFixture

 

HTML

Text()

<p>Line breaks</p> <p>in Word</p>

Line breaks\nin Word

<p>Another</p><p class="MsoNormal">form</p>

Another\nform

<p>Don’t think every tag that</p> <poe>starts with ‘p’ is a paragraph</poe>

Don’t think every tag that starts with ‘p’ is a paragraph

 

Leading and Trailing Whitespace

 

Leading and trailing whitespace are removed.

 

fat.HtmlToTextFixture

 

HTML

Text()

     spaces         

spaces

 

 

   blank lines  

 

 

blank lines

            tabs     

tabs

 

The &nbsp; entity and non-breaking space character (represented here with “\u00a0”) are considered whitespace when removing leading and trailing whitespace.

 

fat.HtmlToTextFixture

 

HTML

Text()

a&nbsp;     

a

  a &nbsp;

a

\u00a0 a \u00a0

a

 

Leading and trailing line breaks are not removed.  (Line breaks are converted to ASCII line feed characters, as described above, and are represented here with “\n”.)

 

fat.HtmlToTextFixture

 

HTML

Text()

<br />a

\na

<p></p><p>a</p>

\na

a<br />

a\n

<p>a</p><p></p>

a\n

 

Whitespace inside of leading and trailing line breaks is not considered leading or trailing whitespace and is not removed.  Instead, it is combined as described in the “adjoining whitespace” section, below.

 

fat.HtmlToTextFixture

 

HTML

Text()

   <br />   a   <br />  

\n a \n

 

Tags other than line-break tags are ignored.  Leading and trailing whitespace on either side of an ignored tag is removed.

 

fat.HtmlToTextFixture

 

HTML

Text()

    <ignored>   a   <tags />  

a

 

Adjoining Whitespace

 

Adjoining whitespace is combined into a single space.  The &nbsp; entity and non-breaking space character are not considered whitespace when combining whitespace.  Tags other than line-break tags are ignored.

 

fat.HtmlToTextFixture

 

HTML

Text()

1   +

 

2

1 + 2

1   <tag />    2

1 2

1 &nbsp;&nbsp;&nbsp;2

1    2

1 \u00a0\u00a0\u00a02

1    2

 

Other HTML

 

Other HTML markup is ignored.

 

fat.HtmlToTextFixture

 

HTML

Text()

<b>text</b>

text

 

  a more <i>complicated

  <spell check=”true”>example</spell></i>

a more complicated example

 

Run Results

 

fit.Summary