Chapter 1. Introduction to SGML

Table of Contents

1.1. SGML, Document Types, and Documents
1.2. SGML and Other Markup Systems
1.2.1. Procedural Markup Versus Declarative Markup
1.2.2. System-Specific Markup Versus Generic Markup
1.2.3. Noncontextual Markup Versus Contextual Markup
1.2.4. SGML Markup Strengths
1.3. SGML Constructs
1.3.1. Elements
1.3.2. Attributes
1.3.3. Entities
1.3.4. Comments
1.3.5. Putting the Pieces Together
1.4. SGML Document Processing

With the advent of readily available computer publishing technologies, the world is awash in electronic documents. Companies create, deliver, and store ever-increasing numbers of manuals, journals, catalogs, and memos in the course of business, and these documents are becoming easier and easier to produce and print.

At the same time, because more and more computer users recognize the difficulties that arise from growing piles of documents, publishers are now offering software tools to search through documents for the desired information.

Unfortunately, because of the way these documents are usually created and stored, some of the information that would have been most valuable for using the documents in creative ways never reaches the electronic files. Instead, the files are usually just computerized versions of a single slickly formatted arrangement of words and pictures that comes out of the typesetting process.

As a result, it is often harder for producers and consumers of electronic documents to find and use the information they want, when they want it, and in the form they want it. Most of the potential of the information in these documents ends up being wasted.

What the files are usually missing is the right kind of “information about the information”: facts about its organizational structure and its “meaning.” The Standard Generalized Markup Language, or SGML, is a technology that provides a framework for providing this extra layer of information in your files, so that you can maximize both the value of your electronic documents and your ability to manage and access them.

Computer systems can use this layer of added value in a variety of ways:

SGML is a language for recording and storing document information—a computer language that nonetheless can also be read and understood by humans. You use SGML to write rules that a group of related documents should follow when they store your desired added value. The general process of figuring out the rules is called modeling, and when you're done modeling and expressing the model in SGML form, your set of rules serves as the “language” that these documents use to “speak” to computers. In SGML terminology, such a group of documents is described by the term document type, and a rule set for a document type is called a document type definition, or DTD.

For example, newspapers might be a document type, and your SGML rules for adding information to the newspapers' electronic files would be a newspaper DTD. Such information might include, for instance, the date each feature article was assigned to be written and the name of the news organization from which each photo was obtained.

For computer systems to find the added value in your files and act on it, you must have stored that added value in the files. But before you can add the value, you need to decide exactly what value you want stored and exactly what its expression in SGML form should be. To make these decisions, you need to encapsulate the desired value in a model and build an effective and cohesive DTD from it. And to build a high-quality DTD, you need a methodology—a system of principles, procedures, and tools that you can apply to the work. In this book we provide such a methodology, along with DTD development techniques and other information about both the business and the art of DTD development.

The rest of this chapter introduces SGML concepts. Chapter 2, Introduction to DTD Development gives some background on our DTD development methodology and some of its tools.

1.1. SGML, Document Types, and Documents

DTDs form the foundation for every SGML-based document production system by:

  • Rigorously recording and enforcing your requirements for document intelligence and structure

  • Controlling the text editors that insert and keep track of the added intelligence and structure, allowing some authoring functions to be automated

  • Controlling the systems that manage whole and partial documents

  • Providing information about the documents to the software that formats them, indexes them for retrieval, and otherwise processes them

The SGML language can be used to express any number of DTDs. Likewise, each DTD can be used as the basis for many documents of the same general type, with each document being an instance—an example—of the type described by its DTD. Any set of similar documents can be considered a document type to be modeled using SGML DTD rules: love letters, project plans, product catalogs, and so on.

Figure 1.1, “SGML, DTDs, and Document Instances” shows the relationships of SGML, DTDs, and document instances.

Figure 1.1. SGML, DTDs, and Document Instances

SGML, DTDs, and Document Instances

All documents stored on computer systems, including SGML documents, have instructions or codes embedded in the text that indicate how the text should be processed. These instructions are called markup, after the handwritten instructions on document manuscripts, indicating typefaces, margins, and other layout specifications, that were once the formal method of communication with human typesetters. SGML markup is what provides the added value to SGML document text, but it is typically used to emphasize what the text represents (that is, what the information actually is) rather than what it looks like or how it is to be processed, so that one appearance or type of processing doesn't get “locked in” to the document files.[1] Each DTD can be said to form a unique markup language.

The idea of a DTD often seems foreign to people who have only used traditional word processors and desktop publishing systems, though a DTD simply makes explicit much of what you know intuitively when you write documents. For example, if you've ever used a Word for Windows™ Version 6.0 character style for “new terms” in a document so that you don't have to keep putting those terms in boldface manually, you were in engaging in applying markup—that is, marking up the document—using a construct of the same kind that a DTD would offer to authors of SGML documents.

In many ways, DTDs are closer to a database technology than a document-publishing technology in that they provide a “schema” for each “field” in a document that contains a unique kind of information that you'll want to access later. But DTDs add a twist to this picture: Some fields can contain other fields. In other words, they're hierarchical.

Since it's a DTD rather than SGML itself that defines the model for acceptable markup within documents, SGML can be considered a metalanguage, a language in which other languages are written. Thus, it may be useful to think of the “Standard Generalized Markup Language” as a document markup system or framework rather than a markup language all by itself. However, when we're talking about SGML-based documents in the abstract without concern for the DTD they are associated with, we use the phrases “SGML markup” and “SGML documents.

Because of the DTD layer between the SGML language and an SGML document, every SGML document must have at least two parts:

  • A document type declaration containing (or pointing to one or more files that contain) the DTD rules to which the markup in this document is supposed to conform

    For example, the electronic file for a newspaper would point to the newspaper DTD in its document type declaration.

  • A document instance containing all the content and embedded markup of the document

    For example, for a newspaper, the electronically stored document instance would consist of the newspaper's actual words and other content plus its DTD-conforming markup.

So far in this discussion, we've used the term “document” as if its meaning were clear and unambiguous. Certainly, it's easy to see that a letter, a book, or a newspaper, for example, constitutes a single document as most people understand the term. However, the definition of the term document in the SGML standard broadens its scope in an interesting way. Here is the definition:

A collection of information that is processed as a unit. A document is classified as being of a particular document type.

An SGML document is any collection of information that gets processed together. So, for example, while a newspaper might be a document, a single article might also count as a document if you deliver individual articles to particular reading audiences. For another example, both cookbooks and recipes might be considered documents, depending on what you intend to do with them. Document types can model whatever level or size of document is appropriate for your needs.

There's another way in which SGML challenges the traditional understanding of a document. If you can hold a printed book in your hand and then view the same basic contents on a computer screen, how many documents do you have—one or two?

The usual view in the word processor world is that a document is made by its medium—that the same content, formatted two different ways, constitutes two documents. SGML takes the opposite view: It's the content and markup that make up an SGML document, and the precise presentation of the material (formatting characteristics, display or suppression of different pieces of content, and so on) happens apart from document creation.

The particular expression of an SGML document when it is presented to a reader is sometimes called a presentation instance, a term analogous to document instance. It might help to think of an SGML document as the source of a potentially infinite number of presentation instances—a “proto-document” as far as printing and delivery are concerned. Figure 1.2, “SGML Documents and Presentation Instances” shows the relationship between SGML documents and possible presentation instances.

Figure 1.2. SGML Documents and Presentation Instances

SGML Documents and Presentation Instances

1.2. SGML and Other Markup Systems

It's useful to describe the characteristics of SGML markup by contrasting it with other types of markup. TeX, troff, and Script are examples of traditional typesetting markup languages that use embedded codes that are visible to authors who are working on the files. Originally, word processor software also used visible markup codes in text files, but modern word processors and desktop publishing systems incorporate viewing software that instead shows authors only the formatting effect of the markup, simulating the formatted version of the document.

We'll examine SGML and various other markup systems along the following dimensions:

  • Procedural versus declarative

  • System-specific versus generic

  • Noncontextual versus contextual

1.2.1. Procedural Markup Versus Declarative Markup

Procedural markup supplies detailed instructions for actions that software must follow in processing the data—it says, “Do x.” For example, the text of each item in a bulleted list might be preceded with markup that provides the following instructions to the software:

Output the bullet symbol “” and tab over to 18 points from the original line length before you output the first line of text, and indent any subsequent lines of text 18 points.

The text of each item might be followed with markup providing this instruction.

Restore the original line length.

The markup would have the following effect when the text is formatted.

  • An apple is a fruit that is sweet-tasting, approximately round in shape, and red, yellow, or green in color.

  • An orange is a fruit that is sweet-tasting, approximately round in shape, and orange in color.

By contrast, declarative (or descriptive) markup supplies only high-level logical descriptions of the data's role or purpose (“I am a y”), expecting that separate processing software will map the markup to the precise actions to be performed along with actually performing the actions. For example, the text of the same bulleted list item might instead be surrounded with markup that says only:

This is the start of a bulleted list item.

and:

This is the end of a bulleted list item.

This markup puts the text in a “virtual container.” A formatting system must apply rules for using this markup appropriately to change line lengths, add bullet characters, and generally manage the output of the item.

Traditional markup systems usually offer both a set of basic procedural requests and a way to define groupings of requests called “macros.” Desktop publishing systems and word processors have a similar capability, usually called “styles.” Such groupings of requests can be designed to be relatively declarative. For example, following is a list encoded with a hypothetical troff macro package:

.LS                                start of bulleted list

.Li                                start of list item (no end-macro)

An apple is a fruit that is
sweet-tasting, approximately
round in shape, and red, yellow,
or green in color.
.Li                                start of list item (no end-macro)

An orange is a fruit that is
sweet-tasting, approximately
round in shape, and orange in
color.
.LE                                end of bulleted list

The calls to the macros in the source file are highly declarative in that they don't mention the exact appearance the list will have. Thus, you can use different macro definitions with the same file to get different formatting results.

Procedural markup is efficient because it allows a computer to follow the supplied instructions without doing additional interpretive work. However, it binds the data closely to a single kind of manipulation, such as choosing one value for list-item indent over all other values.

Declarative markup requires an additional interpretation step, but allows document data that is stored in a single form to be formatted, analyzed, and manipulated many different ways, increasing the value of the data once it has been described thoroughly and abstractly. For example, if the “house style” changes, new interpretation rules could be substituted for the old ones to change the indents or bullet characters for list items. Of course, this flexibility is only an option wherever declarative markup has been used to the exclusion of procedural markup. Word processor users who have controlled all facets of formatting manually can't take advantage of changes in style definitions.

Purely procedural and purely declarative markup are actually two ends of a continuum. Typesetting markup languages and desktop publishing systems tend to use markup that is closer to the procedural end because they give authors a large degree of control over the formatted appearance. For example, systems like troff often come with sets of declarative macros, but allow authors to use procedural markup at any time. By constrast, SGML markup can be highly declarative because it does not “come with” any particular formatting system; if an SGML markup language is well designed and properly used, it is independent of procedures and processing and does not allow the data's value for multiple purposes to be compromised.

1.2.2. System-Specific Markup Versus Generic Markup

System-specific markup works only with particular electronic-publishing software applications that create and format the document data, and the packages are often limited to particular hardware platforms. For example, troff markup is intended to be processed by the troff-based formatting software common on UNIX™ systems and cannot be directly processed on PCs with desktop publishing software. The reverse is also true: Files produced by word processors and desktop publishing programs such as Word and WordPerfect™ are useless in a UNIX troff processing environment. To make the systems share the documents, you must convert the documents to each target system's native markup, an expensive process with a high potential for loss of information through conversion, and you must also manage other difficult problems such as file storage formats and character sets.

Generic markup, however, is independent of the characteristics of any one system. SGML is a truly generic markup system for document text because it lends itself to being shared successfully among systems:

  • It is stored in plain text files, typically ASCII, which are potentially able to be handled by any computer system.

  • It provides a formal way to record the character sets used in the document files.

  • It is the joint creation of a formal, worldwide process for building standards through consensus under the International Organization for Standardization (ISO), rather than by any one software vendor. Thus, your choices for software tools that process your existing textual document files remain open if your current system ceases to meet your needs, and you can manage the same document files on heterogeneous platforms.

  • Likewise, the design of DTDs is owned by document producers rather than by software or hardware vendors, thus reflecting the real needs of producers and protecting them from changes in the markup language that benefit vendors exclusively.

    In fact, many DTDs have already been developed for various vertical segments of the information-publishing market, such as the CALS (Continuous Acquisition and Lifecycle Support) DTDs for information produced by U.S. Department of Defense contractors, the ATA (Air Transport Association) DTDs for commercial aircraft maintenance information, and DocBook for software documentation.

Note that even if a markup language is declarative, it's not necessarily generic. For example, the style names in a Word for Windows™ template can be relatively free of references to appearance, but the styles must still be processed by Word software.

1.2.3. Noncontextual Markup Versus Contextual Markup

Most markup languages are noncontextual; that is, the data and markup occur in a stream on which particular relationships of order and hierarchy aren't explicitly imposed. For example, troff is a noncontextual markup language. In the example in Section 1.2.1, “Procedural Markup Versus Declarative Markup”, list items are shown as being “inside” a pair of macros representing the start and end of a list, which is helpful in making the markup declarative. However, nothing in the definitions of the .LS, .Li, and .LE macros specifies that a connection exists between lists and list items. Each definition just contains a series of formatting instructions, with the assumption that authors won't inappropriately put an .LE before the first .LS, or use chapter-heading macros inside pairs of list macros.

With a markup language that has the notion of context, the ordering and hierarchical containment of pieces of information, you can use context to your advantage by:

  • Making explicit rules about how the pieces should interact

  • Using computers to check the structural validity according to your rules

  • Processing and searching for text based on its context

SGML markup has a built-in notion of hierarchical containers for information that allows you to ensure, for example, that “lists” contain “items” and that they don't contain “chapters.” You can also test whether a bulleted list item is inside another bulleted list, and format it to have an introductory dash instead of a bullet and to have an additional indent level, or do the same for various levels of numbered lists so that the top level has Arabic numbers, the second level has lowercase letters, and so on. To achieve the same effect in a word processor, you must use a different style for each possible location in which a list item of each type occurs, with the result that you can end up with dozens of styles for a single logical kind of text.

1.2.4. SGML Markup Strengths

To summarize, SGML markup is unique in that it combines several design strengths:

  • It is declarative, which helps document producers “write once, use many”—putting the same document data to multiple uses, such as delivery of documents in a variety of online and paper formats and interchange with others who wish to use the documents in different ways.

  • It is generic across systems and has a nonproprietary design, which helps make documents vendor and platform independent and “future-proof”—protecting them against changes in computer hardware and software.

  • It is contextual, which heightens the quality and completeness of processing by allowing documents to be structurally validated and by enabling logical collections of data to be manipulated intelligently.

The characteristics of being declarative, generic, nonproprietary, and contextual make the Standard Generalized Markup Language “standard” and “generalized.

1.3. SGML Constructs

The grammar of every SGML markup language has four basic “parts of speech”: elements, attributes, entities, and comments. In the following sections we'll use a hypothetical set of DTD rules for a very simple “recipe” document type, along with two real recipes, to illustrate these parts of speech.

1.3.1. Elements

Section 1.2, “SGML and Other Markup Systems ” described the notion of nestable containers for collections of document information. In SGML, containers are called elements. The DTD rule for the occurrence and sequence of document data and other elements inside a particular kind of element, or element type, is called a content model. Our recipe DTD creates six element types, and declares their content models as follows:

  • A “recipe” element must contain a “title” element, followed by an “ingredient list” element, followed by an “instruction list” element. All these inner elements are required.

  • A “title” element contains characters.

  • An “ingredient list” element must contain one or more “ingredient” elements.

  • An “ingredient” element contains characters.

  • An “instruction list” element must contain one or more “step” elements.

  • A “step” element contains characters.

Figure 1.3, “Recipe Elements” shows how these nested elements apply to a real document—a recipe for Hawaiian coconut pudding. The rectangles represent elements containing either the words of the recipe or smaller elements that ultimately contain words.[2]

Figure 1.3. Recipe Elements

Recipe Elements

If you scan the recipe from top to bottom, you cross the upper and lower boundaries of each rectangle in the same logical places that you would come across SGML element markup in a document instance. Each upper boundary corresponds to an element start-tag, and each lower boundary to an element end-tag. By default, SGML tag markup consists of the name of the element type surrounded by angle brackets ( < > ), with the addition of a slash ( / ) before the name in the end-tag, as follows.

<recipe>                            recipe start-tag

<title>                             title start-tag
Haupia (Coconut Pudding)
</title>                            title end-tag
<ingredient-list>                   ingredient-list start-tag

<ingredient>                        ingredient start-tag

12 ounces coconut milk
</ingredient>                       ingredient end-tag

<ingredient>
4 to 6 tablespoons sugar
</ingredient>
<ingredient>
4 to 6 tablespoons cornstarch
</ingredient>
<ingredient>
3/4 cup water
</ingredient>
</ingredient-list>                  ingredient-list end-tag

<instruction-list>                  instruction-list start-tag

<step>                              step start-tag
Pour coconut milk into saucepan.
</step>                             step end-tag
<step>
Combine sugar and cornstarch;
stir in water and blend well.
</step>
<step>
Stir sugar mixture into coconut milk;
cook and stir over low heat until thickened.
</step>
<step>
Pour into a nonstick 8-in.
square pan and chill until firm.
</step>
<step>
Cut into 2-inch squares.
</step>
</instruction-list>                 instruction-list end-tag

</recipe>                           recipe end-tag

This example barely scratches the surface of SGML content modeling possibilities. Following is a brief list of the major content model choices:

  • An element can be required (as all the recipe elements are) or optional.

  • An element can be repeatable (as the ingredient and step elements are) or nonrepeatable (as the recipe, title, ingredient list, and instruction list elements are).

  • A group of several elements can be specified so that elements occur in a certain unchangeable order (as the elements inside the recipe element are); in an order left up to the discretion of the document creator; or mutually exclusive of each other.

  • Like single elements, groups of elements can be specified to be required or optional and to have their occurrence controlled.

  • A particular element can be allowed to appear anywhere directly within another element and further down within that element's contents. Conversely, an element can be banned from appearing anywhere within another element.

1.3.2. Attributes

A DTD can specify rules for special labels, called attributes , that can be attached to particular elements to further describe their content. Our sample DTD declares that the recipe and step element types have attributes as follows.

  • A “recipe” element can optionally have values for the following attributes:

    • Type of dish

      The value can be any character string (for example, “starter” or “main course”).

    • Number of servings it makes

      The value must be a number (for example, “10”).

    • Number of minutes it takes to prepare

      The value must be a number (for example, “30”).

  • A “step” element can optionally have a value indicating whether performing the step is necessary. The value can be either “yes” or “no”; if a value isn't supplied explicitly, the default value is “yes.

Figure 1.4, “Recipe Attributes” shows attribute values assigned to these elements; they appear near the top of their respective element rectangles. The attribute values on the steps use the default.

Figure 1.4. Recipe Attributes

Recipe Attributes

In an SGML document instance, attribute information, if there is any, is stored inside an element's start-tag. An attribute has two parts, a name and a value, separated by an equal sign ( = ). For example, for the pudding recipe instance, the markup would look as follows; the attributes are shown on separate lines for ease of reading, but they can all appear on the same line.

<recipe
  type="dessert"
  servings="6"
  preptime="10">        recipe start-tag with attributes
⋮
<step
  necessary="no">       step start-tag with explicitly set value

Thoroughly wash and dry (step included only to show how to set attribute)

the pot you will use.
</step>
<step>                  step start-tag with default value

⋮
</step>
</instruction-list>
</recipe>

An attribute's allowable values can be controlled through its DTD rule. Requiring the value to be a number (a series of digits) is only one possibility; another, for example, is to require the value to be a “name” (a keyword beginning with a letter and containing only letters, digits, and a few special characters).

1.3.3. Entities

A DTD can identify fragments of document content, called entities, that are stored separately from the main content of the documents they're used in. Storing such a fragment separately allows it to be used multiple times and to be updated easily through the changing of a single definition. Documents use entity references to include an entity's content everywhere it is supposed to appear.

Our sample DTD creates an entity called “pour-chill-cut” that contains all the data and all the markup for two sequential steps that are likely to occur in several recipes:

  • A step with a default attribute value, containing the string “Pour into a nonstick 8-in. square pan and chill until firm.

  • A step with a default attribute value, containing the string “Cut into 2-inch squares.

Figure 1.5, “Recipe Entity and Reference” shows where the original document data and markup for the two steps are replaced by the entity reference.

Figure 1.5. Recipe Entity and Reference

Recipe Entity and Reference

In SGML markup, references to entities containing data and markup appear in the document wherever that text is desired, with the name of the entity surrounded by an ampersand ( & ) and a semicolon ( ; ). For example, the “pour-chill-cut” entity reference looks as follows when placed at the end of all the other steps that are physically present in the file (remember that the entity reference includes both the words of the two steps and their <step> tag markup).

⋮
<step>
Stir sugar mixture into coconut milk;
cook and stir over low heat until thickened.
</step>
&pour-chill-cut;                   reference to entity

</instruction-list>
</recipe>

1.3.4. Comments

Any SGML document can contain comments, notes to the author or to other readers of the SGML data and markup, in nearly any location. The DTD can't control whether or where comments are used, and the logical SGML “view” of the document ignores comments entirely as if they weren't present; they appear in the SGML document instance, but they are disposed of during processing.

In SGML document instances, comments are delimited by the strings <!-- and -->. For example, the pudding recipe might contain the following comments.

<recipe
  type="dessert"
  servings="6" 
  preptime="10">
<!--I wrote down this recipe               comment
just as my grandmother told me
about it, but I have some doubts.
I need to test it in the kitchen.
-->
<title>
Haupia (Coconut Pudding)
</title>
<ingredient-list>
<ingredient>
12 ounces coconut milk
</ingredient>
<ingredient>
4 to 6 tablespoons sugar
</ingredient>
<ingredient>
4 to 6 tablespoons cornstarch<!--Is this   comment
amount correct??-->
</ingredient>
<ingredient>
3/4 cup water
</ingredient>
</ingredient-list>
⋮
</recipe>

1.3.5. Putting the Pieces Together

Figure 1.6, “Recipe Elements, Attributes, and Entity” illustrates elements, attributes, and entities for the recipe all at once. (Remember that comments are not part of the logical SGML structure of a document; this is why they aren't shown here.)

Figure 1.6. Recipe Elements, Attributes, and Entity

Recipe Elements, Attributes, and Entity

Example 1.1, “SGML Document for Pudding Recipe” shows the actual SGML document corresponding to the entire pudding recipe in Figure 1.6, “Recipe Elements, Attributes, and Entity”. The first line contains the document type declaration, indicated by an exclamation point ( ! ) and the DOCTYPE keyword. This is the part of the SGML document that points to the DTD rules to which this document instance conforms. In this case, the DTD rules are stored in a file on the system named recipe.dtd (shown in Example 1.2, “Recipe DTD”). The rest of the lines contain the content of the document instance.

Example 1.1. SGML Document for Pudding Recipe

<!DOCTYPE recipe SYSTEM "recipe.dtd">    
pointer to DTD rules
<recipe
  type="dessert"
  servings="6" 
  preptime="10">
<!--I wrote down this recipe
just as my grandmother told me
about it, but I have some doubts.
I need to test it in the kitchen.-->
<title>
Haupia (Coconut Pudding)
</title>
<ingredient-list>
<ingredient>
12 ounces coconut milk
</ingredient>
<ingredient>
4 to 6 tablespoons sugar
</ingredient>
<ingredient>
4 to 6 tablespoons cornstarch<!--Is this
amount correct??-->
</ingredient>
<ingredient>
3/4 cup water
</ingredient>
</ingredient-list>
<instruction-list>
<step>
Pour coconut milk into saucepan.
</step>
<step>
Combine sugar and cornstarch;
stir in water and blend well.
</step>
<step>
Stir sugar mixture into coconut milk;
cook and stir over low heat until thickened.
</step>
&pour-chill-cut;
</instruction-list>
</recipe>

Example 1.2, “Recipe DTD” shows the actual recipe DTD, consisting of the SGML rules found in the recipe.dtd file. Lines in angle brackets ( < > ) and beginning with an exclamation point ( ! ) can be thought of as “statements” in the SGML metalanguage; they are called markup declarations because they specify the rules that the markup must follow. The ELEMENT keyword begins a rule for an element type, the ATTLIST keyword begins a rule for the list of attributes available on an element type, and the ENTITY keyword begins a rule for an entity. The #PCDATA keyword indicates that the element type being declared can contain data characters. (It's possible for a content model to allow #PCDATA to be mixed in with elements, but none of the element types in the recipe DTD happen to allow this configuration.)

Example 1.2. Recipe DTD

<!ELEMENT recipe           - - (title, ingredient-list,
                                instruction-list)>
<!ATTLIST recipe
        type            CDATA           #IMPLIED
        servings        NUMBER          #IMPLIED
        preptime        NUMBER          #IMPLIED
>
<!ELEMENT title            - - (#PCDATA)>
<!ELEMENT ingredient-list  - - (ingredient+)>
<!ELEMENT ingredient       - - (#PCDATA)>
<!ELEMENT instruction-list - - (step+)>
<!ELEMENT step             - - (#PCDATA)>
<!ATTLIST step
        necessary       (yes|no)        yes
>
<!ENTITY  pour-chill-cut
"<step>
Pour into a nonstick 8-in.
square pan and chill until firm.
</step>
<step>
Cut into 2-inch squares.
</step>">

Using this DTD, you can create documents with different arrangements of elements and attributes as long as they adhere to the same rules. Example 1.3, “SGML Document for Fudge Recipe” shows a second recipe conforming to the recipe DTD. It has different numbers of ingredients and steps and uses some different values for attributes, but makes use of the same “pour-chill-cut” entity containing two steps.

To make a point, we've slightly varied the appearance of the physical markup compared to that in Example 1.1, “SGML Document for Pudding Recipe”, putting most of the start-tags and end-tags on the same line as the element's content and adding a few blank lines for readability. These changes to the markup make no difference to the logical SGML view of the document, and no difference to the final output of formatted recipes. In other words, the “formatting” of the actual markup is largely irrelevant.

Example 1.3. SGML Document for Fudge Recipe

<!DOCTYPE recipe SYSTEM "recipe.dtd">
<recipe type="dessert" servings="6" preptime="15">
<title>Two-Minute Fudge</title>

<ingredient-list>
<ingredient>1 pound confectioner's sugar</ingredient>
<ingredient>1/2 cup cocoa</ingredient>
<ingredient>1/4 teaspoon salt</ingredient>
<ingredient>1/2 cup butter</ingredient>
<ingredient>1/4 cup milk</ingredient>
<ingredient>1 tablespoon vanilla</ingredient>
<ingredient>1 cup chopped nuts (optional)</ingredient>
</ingredient-list>

<instruction-list>
<step>In a 1-1/2 quart glass casserole, 
blend the first five ingredients.</step>

<step>Put butter over the top
and microwave on HIGH 2 minutes.</step>

<step>Stir until smooth and
blend in the vanilla.</step>

<step necessary="no">Add the chopped nuts.</step>
<!--Consider suggesting raisins as an alternative?-->

&pour-chill-cut;
</instruction-list>
</recipe>

Finally, Figure 1.7, “Two SGML Documents Conforming to the Recipe DTD” shows a visual comparison of the structure of the two documents, side by side.

Figure 1.7. Two SGML Documents Conforming to the Recipe DTD

Two SGML Documents Conforming to the Recipe DTD

1.4. SGML Document Processing

A DTD is an essential part of an SGML-based document-processing environment.[3] However, it is only a small part. To convert, create, and format documents, build databases for search and retrieval, and otherwise manipulate your SGML documents effectively, you must use software applications. You can break down the kinds of psoftware processing you need to perform into three major categories:

  • Document creation

    This category might include, for example, writing and revising data and markup using an SGML-aware text editor or word processor; converting documents in non-SGML file formats to make them conform to a target DTD on a one-time or routine basis; and assembling whole documents from fragments.

  • Document management

    This category might include, for example, storing and archiving documents and document fragments in a database; extracting document fragments for assembly; and controlling and tracking workflow, document revisions, and access to files.

  • Document utilization

    This category might include, for example, formatting documents for printing and online viewing; indexing them for online retrieval; adding hyperlinks to them for online navigation; and interchanging them with business partners and customers in original SGML form.

Companies might have business requirements for only one or two of the three, or for only a few aspects of each one.

At the base of all SGML-aware software technology is a parser component, which reads SGML documents and recognizes the markup in them so that other software components can process the markup and data. Many products incorporate a validating parser, a special kind of parser that reads DTDs and document instances and finds any markup errors in them. Several public-domain validating parsers are available. Typically, documents are validated as they enter and exit each stage of their processing.

Applications that operate on SGML documents are often said to be “event-driven” because they search through the document looking for markup events: configurations of markup conditions that have been specified by an application developer. They can then process the document content that has been located.

A wide range of SGML-aware search capabilities is available in commercial and public-domain applications. All can detect elements, specific attribute values, and elements that appear inside certain other elements. Many can detect elements that occur in a certain position in a group. It is more rare to find applications that can detect elements whose contents include certain other elements, or any other situation that involves “lookahead” past the current point in the flow of the document. The more capable a system's SGML search capabilities, the more flexible and valuable will be the potential uses. Other factors are important as well; for example, it should be possible to locate data, save it, and output it in a different location.

Because each DTD contains a different set of markup rules and because each company has unique requirements for SGML document processing, each application typically requires customization in order to be usable.

Let's return to our recipe example to see what kinds of utilization might be possible. Once the recipes are marked up thoroughly and precisely (through the use of either an editor or a conversion program), you'll probably want to use processing software that treats them as more than just the antecedents of ink on paper. Certainly, you can format the documents consistently and professionally for a single chosen formatting style. For instance, you can use the value of the necessary attribute on step elements to generate an “Optional: ” prefix on every unnecessary step, or process each recipe's preparation time information to output it in either “n minutes” form or “n hours n minutes” form, depending on whether the preptime attribute value is over 59.

However, you can also produce multiple styles of formatted output from the same source files, for example, regular-print, large-print, and Braille editions of a cookbook, or several cookbooks that bring together different collections of recipes, or packets of index cards with a recipe on each one. Further, you can use the identical files to build a hierarchical recipe database that users can query in interesting ways:

  • I don't have much time to cook. Which dessert recipes take less than 30 minutes to prepare and have fewer than eight required steps?

  • For my dinner party, what is the total shopping list of items necessary to make spinach salad, steak, and chocolate mousse?

  • There's not much food left in the house. Which vegetable dishes require no more than pickles, canned corn, and baking soda? (It is to be hoped that this query won't return any hits!)

Chapter 2, Introduction to DTD Development introduces how you can develop DTDs that help your project meet its goals for document creation, management, and utilization.



[1] In the case of DTDs that describe rules for the analysis of existing documents, what the text “means” might very well be connected to how it looks. For example, if you want to analyze a set of archived newspaper stories, it may be important to know that a story about a certain politician was “above the fold” on page 1 rather than buried deep in Section 2 somewhere. And in the case of DTDs such as the Hypertext Markup Language (HTML), much of the focus is on dictating appearance or browser behavior, while still taking advantage of some of the other benefits of SGML.

[2] From this point on, we generally stick to the standard SGML terminology for document content and strings of characters. Data refers to strings of characters, whereas text refers to any combination of characters and markup that makes up the document content. When we use the word “text” in referring colloquially to the prose-based material that makes up the bulk of the passages in a document, we put it in quotation marks.

[3] It is possible (though not usually advisable) to create and use SGML documents in a way that never requires a DTD to be used by any software. However, a DTD is still advantageous in these cases as a specification that helps humans understand and apply the markup correctly.