6.3.1 The Transformation Language

The transformation process transforms an SGML document into another SGML document under the control of the transformation-specification. The SGML document that is the result of this transformation process may then be used as input to the formatting process.

In the transformation process, a user identifies portions of the SGML document that are to be mapped or transformed. For each node matching the specified portions of SGML content and structure, the transformation is accomplished according to the specification describing the new structures to be created.

All operations performed in this transformation process are independent of the later formatting process. Operations during the transformation process may include the following:

• Combining structures

SGML structures may be reordered and regrouped to create totally new structures. For example, footnotes that are inline with footnote references according to the source DTD may be collected to place the footnotes at the end of each chapter when the document is formatted.

• Creating new elements with user-specifiable relationships to other elements

New structures or attributes may be created. For example, special formatting descriptions such as the need for a 3-point rule, expressed as an SGML attribute, may be associated with every fifth row in a table to provide visual impact.

• Associating new descriptions with particular sequences of content

A sequence of elements in the source document may trigger the association of different formatting characteristics. For example, a paragraph following a warning may be required to be presented differently from all other paragraphs.

• Associating new descriptions with particular components of content

An association may be used to attach special formatting to particular strings of text that may not be specially tagged in the source document, as, for example, in the replacement of the character string ISO with the ISO logo.

DSSSL allows formatting information to be associated with, and dependent on, any combination of the above. Both the content and structure of the SGML document can be modified.

The transformation language can be used to facilitate the formatting process as indicated in the examples above, or it can be used to enhance or modify documents created in accordance with a DTD that has changed over time. It may also be used to transform documents using a public DTD into a proprietary or in-house DTD.

The importance and use of the transformation language will vary depending on the SGML application, the DSSSL application, the capabilities of the formatter, and the implementation. Many formatting applications may require no transformation process at all.

6.3.1.1 Components of the Transformation Process

The component processes are:

1. Grove Building Processor

An SGML document is input to this process. The SGML document or subdocument is parsed and is represented by a collection of nodes called a grove. A grove is similar to an element tree, but may include other subtrees, for example, a subtree of attribute values. Relationships in a grove are expressed in terms of properties. For a complete description of the grove and SGML property definitions, see clause section 9, Groves.

2. Transformer

The input to the transformation process includes the SGML document as created during the grove building step and the transformation-specification.

The transformation-specification consists of a collection of associations. Each association specifies the transformation of like objects in the source document into objects in the result grove. Key to this transformation is that not only can each object be mapped to an explicit location in the result grove, but it can also be mapped to a location using the result of transforming some other source object as a reference point.

The output of the transformation process is the result grove. The transformation process may operate on multiple SGML documents as input to the process, and likewise may transform them into multiple SGML documents. For a complete description of the transformation process, see clause section 11, Transformation Language.

3. SGML Generator

The transformation process produces a grove that must be converted to an SGML document for interchange, validation, and input to the formatting process. The SGML generator is used for this purpose. The output of the SGML generator shall be a valid SGML document. For a complete description of the SGML generator, see section 11.4, SGML Document Generator.

The model of the transformation process is illustrated in the Figure 2, The Transformation Process. Note that the shaded areas indicate the components of the DSSSL specification standardized by this International Standard.

6.3.1.2 Model for Coded Characters, Characters, and Glyph Identifiers

There are three distinct components of this model:

• the coded characters in the SGML source document,

• the characters in the grove,

• the glyph identifiers of the final result document.

The characters in the SGML source document are typically encoded in accordance with a particular character encoding standard, such as ISO 8859-1 (Latin 1). The SGML declaration contains a specification of the character set either in the form of a description or in terms of codepoints in one or more particular, normally standardized or at least registered, coded character sets. It is, however, permitted to refer to a private coded character set as well as giving just a description as a minimum literal of the coded character.

There are many character coding schemes. Some of these use non-spacing characters together with a base character to represent a character with a diacritic. SGML also permits the use of entity references to represent non-keyable characters. For example, a lower case e with acute accent may be represented, in the same document, as

• a single character,

• a non-spacing diacritic and e (2 characters),

• an e and combining diacritic (2 characters),

• the entity reference eacute;.

This variation may cause problems in searching using regular expressions.

In DSSSL, the input characters are normalized into a sequence of characters that each represents a specific meaning regardless of how it was originally encoded as a single character, as multiple characters in a particular character set, or as an entity reference. Each DSSSL specification defines a single character repertoire. The character repertoire shall include all characters used in the DSSSL specification, in the source groves, and in the flow object tree; therefore, only these characters may be used. The declaration of each character also includes a set of properties that may be significant in the formatting process, for example, that the character represents a word space.

The DSSSL specification, which may have been encoded using a different coded character set than the source document, is also translated into a sequence of characters belonging to the same repertoire as the characters used in the DSSSL trees. All comparisons, such as matching an element name, are performed by comparing these characters rather than using the coded characters of the original SGML document.

A sequence of characters in the input grove may be manipulated by a transformation process into another sequence under the control of a character-to-character map. This technique is typically used when parts of the source document contain transliterated text.

The characters in the input grove to the formatter are transformed into glyph identifiers during the formatting process. The transformation is controlled by character-to-glyph and ligature-to-glyph maps in which one or more characters are mapped into one or more glyph identifiers. The map to be used is not fixed for a document, but is expressed as a formatting characteristic that may be specified for an area or for a portion of the input grove. Ligatures are specified by mapping more than one character to a single glyph.

Additional properties specify the font to be used. This information, together with the glyph identifier, selects an actual shape to be used in rendering. Hyphenation points are determined based on the characters, but width calculations are based on the metrics of the actual rendering shapes (i.e., based on the glyphs).