


NSGMLS(1)                                               NSGMLS(1)


NAME
       nsgmls - a validating SGML parser

       An SGML System Conforming to
       International Standard ISO 8879 --
       Standard Generalized Markup Language

SYNOPSIS
       nsgmls [ -deglprsuv ] [ -alinktype ] [ -ffile ] [ -iname ]
       [ -mfile ] [ filename...  ]

DESCRIPTION
       Nsgmls parses and validates the SGML  document  entity  in
       filename...   and  prints  on the standard output a simple
       text representation of its Element  Structure  Information
       Set.   (This  is  the  information  set which a structure-
       controlled conforming SGML application should  act  upon.)
       Note  that  the document entity may be spread amongst sev-
       eral files; for example, the  SGML  declaration,  document
       type  declaration  and document instance set could each be
       in a separate file.  If no filenames are  specified,  then
       nsgmls  will  read  the  document entity from the standard
       input.  Each filename is actually interpreted as a  system
       identifier.   A  command line filename of - can be used to
       refer to the standard input.  (Normally in a system  iden-
       tifier, fd:0 is used to refer to standard input.)

       The following options are available:

       -alinktype
              Make  link  type  linktype  active.   Not  all ESIS
              information is output in this case: the active LPDs
              are  not  explicitly  reported,  although each link
              attribute is qualified with  its  link  type  name;
              there is no information about result elements; when
              there are multiple link  rules  applicable  to  the
              current element, nsgmls always chooses the first.

       -d     Warn about duplicate entity declarations.

       -e     Describe  open  entities  in error messages.  Error
              messages always include the position  of  the  most
              recently opened external entity.

       -ffile Redirect  errors  to  file.   This is useful mainly
              with shells that  do  not  support  redirection  of
              stderr.

       -g     Show the GIs of open elements in error messages.

       -iname Pretend that

                     <!ENTITY % name "INCLUDE">




                                                                1





NSGMLS(1)                                               NSGMLS(1)


              occurs  at  the start of the document type declara-
              tion subset in the  SGML  document  entity.   Since
              repeated definitions of an entity are ignored, this
              definition will take precedence over any other def-
              initions of this entity in the document type decla-
              ration.  Multiple -i options are allowed.   If  the
              SGML declaration replaces the reserved name INCLUDE
              then the new reserved name will be the  replacement
              text  of  the  entity.  Typically the document type
              declaration will contain

                     <!ENTITY % name "IGNORE">

              and will use %name; in the status keyword  specifi-
              cation  of  a  marked section declaration.  In this
              case the effect of the option will be to cause  the
              marked section not to be ignored.

       -l     Output  L  commands  giving the current line number
              and filename.

       -mfile Map public identifiers and entity names  to  system
              identifiers using the catalog entry file whose sys-
              tem identifier is file.  Multiple  -m  options  are
              allowed.  Catalog entry files specified with the -m
              option will be searched before the defaults.

       -p     Parse only the  prolog.   Nsgmls  will  exit  after
              parsing the document type declaration.  Implies -s.

       -r     Warn about defaulted references.

       -s     Suppress output.   Error  messages  will  still  be
              printed.

       -u     Warn about undefined elements: elements used in the
              DTD but not defined.

       -v     Print the version number.

   External entities
       An external entity resides in one or more storage objects,
       each  of  which  contains a sequence of bytes.  The entity
       manager component of nsgmls maps  a  sequence  of  storage
       objects into an entity as follows:

       1.     The bytes in each storage object are converted into
              characters, each represented by a single bit combi-
              nation, according to the encoding translation asso-
              ciated with the storage object.

       2.     The characters in each storage object are  concate-
              nated.




                                                                2





NSGMLS(1)                                               NSGMLS(1)


       3.     The sequence of characters is treated as a sequence
              of lines each terminated by a line terminator.  The
              line terminator is either a line feed or a carriage
              return or a a carriage return followed  by  a  line
              feed.   Nsgmls  determines which line terminator to
              use for a storage object according to which of  the
              possible  line  terminators  is  used for the first
              line of the storage  object.   A  record  start  is
              inserted  at  the  beginning  of  each  line, and a
              record end at the end of each line.  If there is  a
              partial line (a line that doesn't end with the line
              terminator) at the end of the entity, then a record
              start  will be inserted before it but no record end
              will be inserted after it.

       An encoding translation defines a translation between  the
       storage  coding  system and the entity coding system.  The
       storage coding system represents characters  by  sequences
       of  bytes;  it  can  be  variable width and stateful.  The
       entity coding system represents each character by a single
       bit  combination;  it is fixed-width (but not limited to 8
       bits) and  stateless.   Note  that  the  SGML  declaration
       describes  the entity coding system not the storage coding
       system.

   System identifiers
       A  system  identifier  describes  a  sequence  of  storage
       objects, each optionally associated with a encoding trans-
       lation.  Nsgmls will attempt to interpret a system identi-
       fier  as  a  keyword  followed  by  a  colon followed by a
       string, which is interpreted in a  keyword-dependent  way.
       Keywords are case-insensitive.  The following keywords are
       recognized:

       file   The string is interpreted as a filename.  The  sys-
              tem  identifier  describes  a single storage object
              that will be read from the named file.

       fd     The string is as a number.  The  system  identifier
              describes  a  single  storage object that will read
              from the file descriptor  with  that  number.   For
              example,  fd:0  will  read  the storage object from
              standard input.

       concat The string is treated as a list of substrings sepa-
              rated  by  + characters.  Each of the substrings is
              in turn interpreted as a system identifier, and the
              sequences  of  storage objects that each denote are
              concatenated.    The   concat   system   identifier
              describes   the   resulting   sequence  of  storage
              objects.

       http   The  string  together  with  the  http:  prefix  is
              treated  as  a URL.  This is implemented only under



                                                                3





NSGMLS(1)                                               NSGMLS(1)


              Unix.

       utf8   The string is interpreted as  a  system  identifer.
              Each  storage  object that it describes that is not
              associated with a encoding translation  is  associ-
              ated  with  an encoding translation that translates
              UTF8 to fixed-width encoding.   Invalid  multi-byte
              sequences  are represented by the character 0xFFFD.
              This keyword is recognized only in  the  multi-byte
              version of nsgmls.

       replace
              The  string  is interpreted as a system identifier.
              Numeric character references using the SGML  refer-
              ence   concrete   syntax  will  be  recognized  and
              replaced  within  each  storage  object  identifier
              occuring in the system identifier.

       ucs2   The  string  is  interpreted as a system identifer.
              Each storage object that it describes that  is  not
              associated  with  a encoding translation is associ-
              ated with an encoding translation  that  translates
              UCS2  to a fixed width encoding.  The more signifi-
              cant octet of each character  always  precedes  the
              less significant octet irrespective of the system's
              native byte-order.  The codes 0xFFFE and 0xFEFF are
              not  treated specially in any way.  This keyword is
              recognized  only  in  the  multi-byte  version   of
              nsgmls.

       unicode
              The  string  is  interpreted as a system identifer.
              Each storage object that it describes that  is  not
              associated  with  a encoding translation is associ-
              ated with the an encoding translation, which trans-
              lates  the  Unicode  coding system to a fixed-width
              encoding.  The Unicode coding  system  treats  each
              pair  of octets as a character in the system's byte
              order.  If the first character is  the  byte  order
              mark  character (0xFEFF), it will be discarded.  If
              the first character is the byte order mark  charac-
              ter  byte-swapped,  it  will  be  discarded and the
              remaining characters will  be  byte-swapped.   This
              keyword  is  recognized only in the multi-byte ver-
              sion of nsgmls.

       ujis   The string is interpreted as  a  system  identifer.
              Each  storage  object that it describes that is not
              associated with a encoding translation  is  associ-
              ated with an encoding translation where the storage
              coding  system  is  variable-width  (packed)   UJIS
              (EUC), and the entity coding system represents each
              character in the same way as the EUC complete  two-
              byte  format.  In the entity coding system the code



                                                                4





NSGMLS(1)                                               NSGMLS(1)


              of characters in the G0 set (usually  the  Japanese
              version of ISO 646) is unchanged; The code of char-
              acters in the G1 set (usually JIS X  0208-1990)  is
              ORed  with 0x8080; the code of characters in the G2
              set  (usually  half-width  katakana  from   JIS   X
              0201-1986) is ORed with 0x0080; the code of charac-
              ters in the G3 set (JIS X 0212-1990) is  ORed  with
              0x8000.   This  keyword  is  recognized only in the
              multi-byte version of nsgmls.

       sjis   The string is interpreted as  a  system  identifer.
              Each  storage  object that it describes that is not
              associated with a encoding translation  is  associ-
              ated with an encoding translation where the storage
              coding system is Shift JIS and  the  entity  coding
              system is the same as with the ujis encoding trans-
              lation (except for characters in the G3  set  which
              are  not representable using Shift JIS.)  This key-
              word is recognized only in the  multi-byte  version
              of nsgmls.

       identity
              The  string  is  interpreted as a system identifer.
              Each storage object that it describes that  is  not
              associated  with  a encoding translation is associ-
              ated with the identity encoding  translation.   The
              identity coding system converts bytes to characters
              by zero-extending each character.

       raw    The string is interpreted as a  system  identifier.
              No  translation  of line-terminators onto RS and RE
              characters  will  be  performed  for  each  storage
              object that it describes.  Error messages referring
              to these storage objects will not contain line num-
              bers.

       huge   This  keyword  is intended for use with huge files,
              for which the cost of keeping track of line  bound-
              aries  (roughly  one  byte  per line) is too large.
              The string is interpreted as a  system  identifier.
              For  each  storage object that it describes, nsgmls
              will not keep track of where line boundaries  occur
              as  it  usually  does.  Error messages referring to
              these storage objects will not  contain  line  num-
              bers.

       If  a system identifier does not contain a keyword or uses
       a keyword that is not recognized, then the system  identi-
       fier  will be treated as a filename.  Note that the system
       identifier file:utf8:doc.sgm  identifies  the  file  named
       utf8:doc.sgm  but  utf8:file:doc.sgm  identifies  the file
       named doc.sgm using the utf8 coding scheme.

       A relative filename in a system identifier is  interpreted



                                                                5





NSGMLS(1)                                               NSGMLS(1)


       relative  to  the  file  in which the system identifier is
       specified, if any, and otherwise relative to  the  current
       directory.  This applies both to system identifiers speci-
       fied in SGML documents, and to system  identifiers  speci-
       fied in catalog entry files.

       If  a  system  identifier  does  not  specify the encoding
       translation,  the  encoding  translation  of  the  storage
       object  in  which the system identifier was specified will
       be used.

   System identifier generation
       If a system identifier is not specified, then  the  entity
       manager  will  attempt to generate one using catalog entry
       files in the format defined in the SGML Open Draft Techni-
       cal Resolution on Entity Management.  A catalog entry file
       contains a sequence of entries in  one  of  the  following
       four forms:

       PUBLIC pubid sysid
              This  specifies  that  sysid  should be used as the
              system  identifier  if  the  public  identifier  is
              pubid.   Sysid is a system identifier as defined in
              ISO 8879  and  pubid  is  a  public  identifier  as
              defined in ISO 8879.

       ENTITY name sysid
              This  specifies  that  sysid  should be used as the
              system identifier if the entity is a general entity
              whose name is name.

       ENTITY %name sysid
              This  specifies  that  sysid  should be used as the
              system identifier if  the  entity  is  a  parameter
              entity  whose  name is name.  Note that there is no
              space between the % and the name.

       DOCTYPE name sysid
              This specifies that sysid should  be  used  as  the
              system  identifier  if  the  entity  is  an  entity
              declared in a document type declaration whose docu-
              ment type name is name.

       LINKTYPE name sysid
              This  specifies  that  sysid  should be used as the
              system  identifier  if  the  entity  is  an  entity
              declared in a link type declaration whose link type
              name is name.

       OVERRIDE
              This specifies that system identifiers specified in
              the  catalog  should  override  system  identifiers
              specified in the document.  Normally, if an  entity
              declaration  in  the  document  specifies  a system



                                                                6





NSGMLS(1)                                               NSGMLS(1)


              identifier, the catalog is not consulted.  If OVER-
              RIDE  is  specified,  then  the catalog is searched
              first; the system only uses the  system  identifier
              specified  in the document, if no match is found in
              the catalog.

       SGMLDECL sysid
              This specifies that if the document does  not  con-
              tain  an  SGML declaration, the SGML declaration in
              sysid should be implied.

       The last four forms are extensions to the SGML  Open  for-
       mat.   The  delimiters  can be omitted from the sysid pro-
       vided it does not contain any white space.   Comments  are
       allowed  between  parameters  delimited  by -- as in SGML.
       The environment  variable  SGML_CATALOG_FILES  contains  a
       list  of  catalog  entry  files.  The list is separated by
       colons under Unix and by semi-colons under  MSDOS.   These
       will  be  searched after any catalog entry files specified
       using the -m option.  If this environment variable is  not
       set,  then  a system dependent list of catalog entry files
       will be used.  A match in a catalog entry file for a  PUB-
       LIC  entry  will  take precedence over a match in the same
       file for an ENTITY, DOCTYPE or LINKTYPE entry.

   System declaration
       The system declaration for nsgmls is as follows:

                               SYSTEM "ISO 8879:1986"
                                       CHARSET
       BASESET  "ISO 646-1983//CHARSET
                 International Reference Version (IRV)//ESC 2/5 4/0"
       DESCSET  0 128 0
       CAPACITY PUBLIC  "ISO 8879:1986//CAPACITY Reference//EN"
                                      FEATURES
       MINIMIZE DATATAG NO        OMITTAG  YES     RANK     YES   SHORTTAG YES
       LINK     SIMPLE  YES 65536 IMPLICIT YES     EXPLICIT YES 1
       OTHER    CONCUR  NO        SUBDOC   YES 100 FORMAL   YES
       SCOPE    DOCUMENT
       SYNTAX   PUBLIC  "ISO 8879:1986//SYNTAX Reference//EN"
       SYNTAX   PUBLIC  "ISO 8879:1986//SYNTAX Core//EN"
                                      VALIDATE
                GENERAL YES       MODEL    YES     EXCLUDE  YES   CAPACITY NO
                NONSGML YES       SGML     YES     FORMAL   YES
                                        SDIF
                PACK    NO        UNPACK   NO

       The limit for the SUBDOC parameter is memory dependent.

       Any legal concrete syntax may be used.

   SGML declaration
       The SGML declaration may be omitted, the following  decla-
       ration will be implied:



                                                                7





NSGMLS(1)                                               NSGMLS(1)


                             <!SGML "ISO 8879:1986"
                                     CHARSET
       BASESET  "ISO 646-1983//CHARSET
                 International Reference Version (IRV)//ESC 2/5 4/0"
       DESCSET    0  9 UNUSED
                  9  2  9
                 11  2 UNUSED
                 13  1 13
                 14 18 UNUSED
                 32 95 32
                127  1 UNUSED
       CAPACITY PUBLIC  "ISO 8879:1986//CAPACITY Reference//EN"
       SCOPE    DOCUMENT
       SYNTAX   PUBLIC  "ISO 8879:1986//SYNTAX Reference//EN"
                                    FEATURES
       MINIMIZE DATATAG NO OMITTAG  YES          RANK     NO  SHORTTAG YES
       LINK     SIMPLE  NO IMPLICIT NO           EXPLICIT NO
       OTHER    CONCUR  NO SUBDOC   YES 99999999 FORMAL   YES
                                  APPINFO NONE>
       with the exception that characters 160 through 254 will be
       assigned to DATACHAR.

       Nsgmls identifies base character sets using the  designat-
       ing sequence in the public identifier.  The following des-
       ignating sequences are recognized:
         Designating          ISO         Minimum      Number
            Escape        Registration   Character       of             Description
           Sequence          Number       Number     Characters
       -----------------------------------------------------------------------------------
       ESC 2/5 4/0             -             0              128   full set of ISO 646 IRV
       ESC 2/8 4/0              2            0              128   G0 set of ISO 646 IRV
       ESC 2/8 4/2              6            0              128   G0 set of ASCII
       ESC 2/13 4/1           100            0              128   G1 set of ISO 8859-1
       ESC 2/1 4/0              1            0               32   C0 set of ISO 646
       ESC 2/2 4/3             77            0               32   C1 set of ISO 6429
       ESC 2/5 2/15 4/0       162            0            65536   ISO 10646 UCS-2 level 1
       ESC 2/5 2/15 4/3       174            0            65536   ISO 10646 UCS-2 level 2
       ESC 2/5 2/15 4/5       176            0            65536   ISO 10646 UCS-2 level 3
       ESC 2/5 2/15 4/1       163            0       2147483648   ISO 10646 UCS-4 level 1
       ESC 2/5 2/15 4/4       175            0       2147483648   ISO 10646 UCS-4 level 2
       ESC 2/5 2/15 4/6       177            0       2147483648   ISO 10646 UCS-4 level 3

       The graphic character sets do not strictly include C0  and
       C1  control  character sets.  For convenience, nsgmls aug-
       ments the graphic character sets with the appropriate con-
       trol character sets.

   Output format
       The output is a series of lines.  Lines can be arbitrarily
       long.  Each line consists of an initial command  character
       and  one  or more arguments.  Arguments are separated by a
       single space, but when a command takes a fixed  number  of
       arguments  the last argument can contain spaces.  There is
       no space between  the  command  character  and  the  first



                                                                8





NSGMLS(1)                                               NSGMLS(1)


       argument.   Arguments  can  contain  the  following escape
       sequences.

       \\     A \.

       \n     A record end character.

       \|     Internal SDATA entities are bracketed by these.

       \nnn   The character whose code is nnn octal.

       A record start character  will  be  represented  by  \012.
       Most  applications  will need to ignore \012 and translate
       \n into newline.

       The possible command characters and arguments are as  fol-
       lows:

       (gi    The start of an element whose generic identifier is
              gi.  Any attributes for this element will have been
              specified with A commands.

       )gi    The  end an element whose generic identifier is gi.

       -data  Data.

       &name  A reference to an external data entity  name;  name
              will have been defined using an E command.

       ?pi    A processing instruction with data pi.

       Aname val
              The  next  element  to  start has an attribute name
              with value val which takes  one  of  the  following
              forms:

              IMPLIED
                     The value of the attribute is implied.

              CDATA data
                     The  attribute  is  character data.  This is
                     used for attributes whose declared value  is
                     CDATA.

              NOTATION nname
                     The attribute is a notation name; nname will
                     have been defined using a N  command.   This
                     is  used for attributes whose declared value
                     is NOTATION.

              ENTITY name...
                     The attribute is a list  of  general  entity
                     names.   Each  entity  name  will  have been
                     defined using an I, E or S command.  This is



                                                                9





NSGMLS(1)                                               NSGMLS(1)


                     used  for attributes whose declared value is
                     ENTITY or ENTITIES.

              TOKEN token...
                     The attribute is a list of tokens.  This  is
                     used  for attributes whose declared value is
                     anything else.

       Dename name val
              This is the same as the A command, except  that  it
              specifies  a  data attribute for an external entity
              named ename.  Any D commands will come after the  E
              command  that  defines  the  entity  to  which they
              apply, but before any & or A commands  that  refer-
              ence the entity.

       atype name val
              The next element to start has a link attribute with
              link type type, name name,  and  value  val,  which
              takes the same form as with the A command.

       Nnname nname.  Define a notation This command will be pre-
              ceded by a p command if the notation  was  declared
              with a public identifier, and by a s command if the
              notation was declared with a system identifier.   A
              notation will only be defined if it is to be refer-
              enced in an E command or in an  A  command  for  an
              attribute with a declared value of NOTATION.

       Eename typ nname
              Define  an  external  data  entity named ename with
              type typ (CDATA, NDATA or SDATA) and notation  not.
              This command will be preceded by one or more f com-
              mands giving the filenames generated by the  entity
              manager  from the system and public identifiers, by
              a p command if a public identifier was declared for
              the  entity, and by a s command if a system identi-
              fier was declared for the entity.   not  will  have
              been  defined  using  a N command.  Data attributes
              may be specified for the entity using  D  commands.
              An  external data entity will only be defined if it
              is to be referenced in a & command or in an A  com-
              mand  for  an  attribute  whose  declared  value is
              ENTITY or ENTITIES.

       Iename typ text
              Define an internal data  entity  named  ename  with
              type typ (CDATA or SDATA) and entity text text.  An
              internal data entity will only be defined if it  is
              referenced  in  an A command for an attribute whose
              declared value is ENTITY or ENTITIES.

       Sename Define a subdocument entity named ename.  This com-
              mand  will  be  preceded  by one or more f commands



                                                               10





NSGMLS(1)                                               NSGMLS(1)


              giving the filenames generated by the  entity  man-
              ager from the system and public identifiers, by a p
              command if a public identifier was declared for the
              entity,  and  by a s command if a system identifier
              was declared for the entity.  A subdocument  entity
              will  only  be  defined  if it is referenced in a {
              command or in an A command for an  attribute  whose
              declared value is ENTITY or ENTITIES.

       ssysid This  command applies to the next E, S or N command
              and specifies the associated system identifier.

       ppubid This command applies to the next E, S or N  command
              and specifies the associated public identifier.

       ffilename
              This command applies to the next E or S command and
              specifies an associated filename.

       {ename The start of the  SGML  subdocument  entity  ename;
              ename will have been defined using a S command.

       }ename The end of the SGML subdocument entity ename.

       Llineno file
       Llineno
              Set  the  current  line  number  and filename.  The
              filename argument will be omitted if only the  line
              number  has  changed.   This will be output only if
              the -l option has been given.

       #text  An APPINFO parameter of text was specified  in  the
              SGML declaration.  This is not strictly part of the
              ESIS, but  a  structure-controlled  application  is
              permitted  to act on it.  No # command will be out-
              put if APPINFO NONE was  specified.   A  #  command
              will  occur  at most once, and may be preceded only
              by a single L command.

       C      This command indicates that the document was a con-
              forming  SGML document.  If this command is output,
              it will be the last command.  An SGML  document  is
              not  conforming  if  it  references  a  subdocument
              entity that is not conforming.

ENVIRONMENT
       NSGMLS_CODE
              If this is set to the name of a  encoding  transla-
              tion,  then  that encoding translation will be used
              as the default encoding translation for  everything
              (including file input, file output, message output,
              filenames and command line  arguments).   Otherwise
              the  identity  encoding  translation  will be used.
              Setting this to ucs2 or unicode is unlikely to give



                                                               11





NSGMLS(1)                                               NSGMLS(1)


              reasonable results.

SEE ALSO
       The SGML Handbook, Charles F. Goldfarb
       ISO  8879 (Standard Generalized Markup Language), Interna-
       tional Organization for Standardization

BUGS
       Not all ESIS information for LINK is reported.

AUTHOR
       James Clark (jjc@jclark.com).













































                                                               12


