perl6-PDF-Tools
===============
## Overview
perl6-PDF-Tools is an experimental low-level tool-kit for reading and manipulating data from PDF files.
It presents a seamless view of the data in PDF or FDF documents; behind the scenes handling
compression, encryption, fetching of indirect objects and unpacking of object
streams. It is capable of reading, editing and creation or incremental update of PDF files.
This module is primarily intended as base for higher level modules. It can also be used to explore or
patch data in PDF or FDF files.
It does not understand logical PDF document structure. It is however possible to construct simple documents and
perform simple edits by direct manipulation of PDF data. You will need some knowledge of how PDF documents are
structured. Please see 'The Basics' and 'Recommended Reading' sections below.
PDF::DOM and PDF::FDF are
both under construction for high-level manipulation of PDF and FDF documents.
Classes/roles in this tool-kit include:
- `PDF::Reader` - for indexed random access to PDFs
- `PDF::Storage::Filter` - a collection of standard PDF decoding and encoding tools for PDF data streams
- `PDF::Storage::IndObj` - base class for indirect objects
- `PDF::Storage::Serializer` - data marshalling utilities for the preparation of full or incremental updates
- `PDF::Storage::Crypt` - decryption / encryption (V 2 & 3 RC4 only at this stage)
- `PDF::Writer` - for the creation or update of PDFs
- `PDF::DAO` - an intermediate Data Access and Object representation layer (DAO) to PDF data structures. Base classes for PDF::DOM
## Example Usage
To create a one page PDF that displays 'Hello, World!'.
```
#!/usr/bin/env perl6
# creates t/example.pdf
use v6;
use PDF::DAO;
use PDF::DAO::Doc;
sub prefix:>($name){ PDF::DAO.coerce(:$name) };
my @MediaBox = 0, 0, 420, 595;
my %Resources = :Procset[ /'PDF', /'Text'],
:Font{
:F1{
:Type(/'Font'),
:Subtype(/'Type1'),
:BaseFont(/'Helvetica'),
:Encoding(/'MacRomanEncoding'),
},
};
my $doc = PDF::DAO::Doc.new;
my $root = $doc.Root = { :Type(/'Catalog') };
my $outlines = $root = { :Type(/'Outlines'), :Count(0) };
my $pages = $root = { :Type(/'Pages'), :@MediaBox, :%Resources, :Kids[], :Count(0), };
my $Contents = PDF::DAO.coerce( :stream{ :decoded("BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET" ) });
$pages.push: { :Type(/'Page'), :Parent($pages), :$Contents };
$pages++;
my $info = $doc.Info = {};
$info.CreationDate = DateTime.now;
$info.Producer = 'PDF-Tools';
$doc.save-as: 't/example.pdf';
```
Then to update the PDF, adding another page:
```
use v6;
use PDF::DAO::Doc;
my $doc = PDF::DAO::Doc.open: 't/example.pdf';
my $catalog = $doc;
my $Parent = $catalog;
my $Contents = PDF::DAO.coerce( :stream{ :decoded("BT /F1 16 Tf 90 250 Td (Goodbye for now!) Tj ET" ) } );
$Parent.push: { :Type( :name ), :$Parent, :$Contents };
$Parent++;
my $info = $doc.Info //= {};
$info.ModDate = DateTime.now;
$doc.update;
```
## Description
A PDF file consists of data structures, including dictionarys (hashs) arrays, numbers and strings, plus streams
for holding data such as images, fonts and general content.
PDF files are also indexed for random access and may also have filters for stream compression and encryption of streams and strings.
They have a reasonably well specified structure. The document structure starts from
`Root` entry in the trailer dictionary, which is the main entry point into a PDF.
This module is based on the PDF Reference version 1.7 specification. It implements syntax, basic data-types, serialization and encryption rules as described in the first four chapters of the specification. Read and write access to data structures is via direct manipulation of tied arrays and hashes.
`PDF::DAO` provides a set of class builder utilities to enable higher level classes for general application development.
This is put to work in the companion module PDF::DOM (under construction), which contains a much more detailed set of classes to implement much of the remainder of the PDF specification.
## The Basics
PDF files are serialized as numbered indirect objects. The `t/example.pdf` file that we just wrote contains:
```
%PDF-1.3
%...(control characters)
1 0 obj <<
/CreationDate (D:20151225000000Z00'00')
/Producer (PDF-Tools)
>> endobj
2 0 obj <<
/Type /Catalog
/Outlines 3 0 R
/Pages 4 0 R
>> endobj
3 0 obj <<
/Type /Outlines
/Count 0
>> endobj
4 0 obj <<
/Type /Pages
/Count 1
/Kids [ 5 0 R ]
/MediaBox [ 0 0 420 595 ]
/Resources <<
/Font <<
/F1 7 0 R
>>
/Procset [ /PDF /Text ]
>>
>> endobj
5 0 obj <<
/Type /Page
/Contents 6 0 R
/Parent 4 0 R
>> endobj
6 0 obj <<
/Length 46
>> stream
BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET
endstream endobj
7 0 obj <<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>> endobj
xref
0 8
0000000000 65535 f
0000000014 00000 n
0000000101 00000 n
0000000172 00000 n
0000000222 00000 n
0000000400 00000 n
0000000469 00000 n
0000000567 00000 n
trailer
<<
/ID [ <4386dc7bc3489e418b44434e3a168843> <4386dc7bc3489e418b44434e3a168843> ]
/Info 1 0 R
/Root 2 0 R
/Size 8
>>
startxref
673
%%EOF
```
The PDF is composed of a series indirect objects, for example, the first object is:
```
1 0 obj <<
/CreationDate (D:20151225000000Z00'00')
/Producer (PDF-Tools)
>> endobj
```
It's an indirect object with object number `1` and generation number `0`, with a `<<` ... `>>` delimited dictionary containing the
author and the date that the document was created. This PDF dictionary is roughly equivalent to a Perl 6 hash:
``` { :CreationDate("D:20151225000000Z00'00'"), :Producer("PDF-Tools"), } ```
The bottom of the PDF contains:
```
trailer
<<
/ID [ <4386dc7bc3489e418b44434e3a168843> <4386dc7bc3489e418b44434e3a168843> ]
/Info 1 0 R
/Root 2 0 R
/Size 8
>>
startxref
673
%%EOF
```
The `>>` ... `<<` delimited section is the trailer dictionary and the main entry point into the document. The entry `/Info 1 0 R`
is an indirect reference to the first object (object number 1, generation 0) described above.
We can quickly put PDF Tools to work using a Perl 6 REPL, to better explore the document:
```
snoopy: ~/git/perl6-PDF-Tools $ perl6 -MPDF::DAO::Doc
> my $doc = PDF::DAO::Doc.open: "t/example.pdf"
ID => [CÜ{ÃHADCN:C CÜ{ÃHADCN:C], Info => ind-ref => [1 0], Root => ind-ref => [2 0]
> $doc.keys
(Root Info ID)
```
This is the root of the PDF, loaded from the trailer dictionary
```
> $doc
CreationDate => D:20151225000000Z00'00', Producer => PDF-Tools;
```
That's the document information entry, commonly used to store basic meta-data about the document.
(PDF Tools has conveniently fetched indirect object 1 from the PDF, when we dereferenced this entry).
```
> $doc
Outlines => ind-ref => [3 0], Pages => ind-ref => [4 0], Type => Catalog
````
The trailer `Root` entry references the document catalog, which contains the actual PDF content. Exploring
further; the catalog potentially contains a number of pages, each with content.
```
> $doc
Count => 1, Kids => [ind-ref => [5 0]], MediaBox => [0 0 420 595], Resources => Font => F1 => ind-ref => [7 0], Type => Pages
> $doc[0]
Contents => ind-ref => [6 0], Parent => ind-ref => [4 0], Procset => [PDF Text], Type => Page
> $doc[0]
Length => 46
> $doc[0].decoded
BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET
>
```
The page `/Contents` entry is a PDF stream which contains graphical instructions. In the above example, to output the text `Hello, world!` at coordinates 100, 250.
## Datatypes and Coercian
The `PDF::DAO` namespace provides roles and classes for the representation and manipulation of PDF objects.
```
use PDF::DAO::Stream;
my %dict = :Filter( :name );
my $obj-num = 123;
my $gen-num = 4;
my $decoded = "100 100 Td (Hello, world!) Tj";
my $stream-obj = PDF::DAO::Stream.new( :$obj-num, :$gen-num, :%dict, :$decoded );
say $stream-obj.encoded;
```
`PDF::DAO.coerce` is a method for the construction of objects.
It is used internally to build objects from parsed AST data, e.g.:
```
use v6;
use PDF::Grammar::Doc;
use PDF::Grammar::Doc::Actions;
use PDF::DAO;
my $actions = PDF::Grammar::Doc::Actions.new;
PDF::Grammar::Doc.parse("<< /Type /Pages /Count 1 /Kids [ 4 0 R ] >>", :rule