Chapter 2. Introduction to DTD Development

Table of Contents

2.1. DTD Development Phases
2.2. SGML Information Modeling Tools and Formalisms

The recipe DTD presented in Chapter 1, Introduction to SGML didn't come out of nowhere, of course. Its rules were designed through a process of applying knowledge of recipe structures and goals for recipe-processing applications.

If your document processing purposes would be better served by a different or more complex organization for recipes, you could define a different DTD. For example, you might find it useful to add an “amount” element to be used inside the “ingredient” element, so that users can query your recipe database more easily to find dessert recipes with minimum amounts of sugar when they're running low (or on a diet). Or, you might want to create markup for oven temperature settings and food measurements, to facilitate switching between Celsius and Fahrenheit values and volume and weight measurements for different country-specific editions of a cookbook.

DTD development is, at base, an exercise in exactly this kind of value judgment. The list of design issues to be considered can be as long as you want to make it, given your own tradeoffs of time and effort. In the case of the recipe document type, for example, one could ask:

The answers to these questions depend strongly on the intelligence and logical structure you want to put in your documents, which in turn depend on what you intend to do with the intelligence and structure.

Some design considerations have less to do with the innate organization of the documents than with workflow, software design, and human factors. You might need to ask yourself the following questions:

This book naturally can't offer specific advice about the factors in the documents and environment in your individual company that might affect your DTD decisions, so we'll do the next best thing: Provide a framework that helps you do the work of articulating the factors and making the design decisions yourself. The methodological approach that underlies this book is to treat each aspect with the appropriate perspective at the appropriate stage, deferring decisions until the time is ripe for them.

If you're a DTD implementor who's familiar with reading and writing DTDs, you might be wondering why a whole framework for developing DTDs is needed when you could just type everything into a text editor directly as you think of it. Or, if you're a project manager with severe resource constraints, you might think the amount of work involved in “doing the job right” seems prohibitive.

An important goal for us is to discourage “design at the keyboard”—inconsistent decisions made by a programming specialist for expediency, sometimes based on guesses about what the document specialists may want, as a DTD is implemented. No matter what the size of the project, it's best to use a cohesive philosophy to approaching the work, logical steps for executing it, tools for modeling, and formalisms for recording design decisions. In this way, you can maximize your ability to develop DTDs that are relevant to your goals, coherent, reliable, adaptable to changing needs, and reflective of the complexity of the information that you're modeling. You can also gain control over and reduce the overall costs of DTD development by increasing the quality of the result, while reducing the need for later reengineering of software applications and document instances and the retraining of personnel.

Specifically, the methodology can help you do the following when it's properly applied:

Do you need to bother analyzing your needs and developing DTD design requirements in a formal process if you know you'll be using an industry-standard DTD? In a word, yes. First of all, just because there are dozens of industry-standard DTDs doesn't mean that using one of them is appropriate for your business, especially if interchange with customers or business partners is not a goal for you. Second, industry-standard DTDs tend to be too big, too complex, and too general for everyday use; for every industry-standard DTD, there are dozens or hundreds of efforts by companies and consultants to create subsets of and extensions to the standard DTDs for use in real document production environments.

Our focus in this book is primarily on the development of new DTDs, not on the customizing of specific existing DTDs. For advice about implementing customized versions of existing DTDs, you should consult their accompanying documentation, and from there, determine the work that needs to be done. However, the steps for analyzing the suitability of an existing DTD have some similarities to those for doing original design work, and we'll cover these in some detail in Chapter 7, Design Under Special Constraints. Also, the suggestions in Chapter 10, Techniques for DTD Reuse and Customization on implementing DTDs for modularity and customizability can help developers of variants of standard DTDs understand how to approach their work.

The following sections outline the steps, conceptual tools, and formalisms of the methodology.

2.1. DTD Development Phases

Developing a DTD has the following overall phases:

  1. Articulate the goals of your project (discussed in Chapter 3, DTD Project Management ). Are you looking for better document validation, gains in author productivity, the ability to deliver documents on CD-ROM and through the Internet as well as on paper, enhanced online search capabilities, all of the above? What documents are in the scope of the project? What is the proposed document processing architecture?

  2. Analyze the needs of your document data (discussed in Chapter 4, Document Type Needs Analysis). What kinds of document intelligence must be considered for encapsulation in markup?

    In this phase, the core of the methodology begins. It has the following steps:

    1. Identify and define the basic information components that the markup must encode.

    2. Classify the components into logical groups.

    3. Validate the analysis against other models that have already been developed.

  3. Design document type requirements based on your goals by modeling your document data with SGML (discussed in Chapter 5, Document Type Modeling and Specification). What are your requirements for markup, based on the knowledge and experience of subject matter experts, processing application developers, and document users?

    This phase continues the core of the methodology. It has the following steps:

    1. Select the components that the document type design should address.

    2. Build the element and attribute models for the overall document hierarchy.

    3. Build the element and attribute models for the mid-level elements.

    4. Build the element and attribute models for low-level elements.

    5. Populate the locations in the overall model where authors can choose from among many elements.

    6. Make connections within the model and from the model to the outside world.

    7. Validate that the model is complete and that it has been informed by similar models already developed.

  4. Complete the design of the actual DTD and implement it (discussed in Part III, “DTD Development”). What techniques should you use for easy maintenance? Should your DTD be modular, to allow for expansion? Should you create a set of interrelated DTDs for overlapping document types or divergent processing purposes?

    We use the term markup model to refer to the aspects of the DTD related solely to the design of the markup; the techniques for maintenance and customization could be considered the DTD's “architecture.

  5. Test the outputs of the design processes and the DTD (discussed in Chapter 11, Validation and Testing). Have you met your goals?

  6. Document the DTD and train people to use it correctly (discussed in Part IV, “Documentation, Training, and Support”). How can authors and application developers best understand and use the DTD?

2.2. SGML Information Modeling Tools and Formalisms

In the course of describing the steps for DTD development, we introduce a number of conceptual tools for modeling document type requirements, as well as formalisms for recording those requirements. Sometimes, as in the case of a project glossary, the tool or formalism is simple prose. More often, sentences of description don't do the job as well as a nonprose version, such as a form, a graphical arrangement, or a matrix.

In addition, the graphical tools that we introduce have been designed to offer more precision in describing SGML requirements than prose does. Natural languages are notoriously ambiguous. For example:

A recipe contains a title and either a recipe cross-reference or a list of ingredients and a list of instruction steps.

Are both lists together an alternative to a recipe cross-reference, or just the first kind of list? While adding some commas in appropriate places would help people interpret the sentence correctly, the fact is that it's easier to construct a grammatically correct description that's ambiguous than one that is crystal clear. By using a tool that's been designed specifically to convey such descriptions, you can avoid many more cases of logical ambiguity.

If precision is required, why not just do information modeling directly in SGML? There are several reasons:

  • The tools provide a simple common language with which both subject matter experts and processing application developers can communicate. Getting the input of people who know the documents inside and out, but don't necessarily know the complexities or syntax of SGML, is essential to high-quality modeling results.

  • The formalisms separate the content of a requirement from its form. If a modeling requirement goes directly from people's minds into SGML markup declaration form, you have no way to check whether the ultimate expression in SGML was faithful to the original intent. This separation is also useful for providing the raw material for DTD documentation, which must be produced anyway.

  • The tools can offer conceptual modeling ideas that overlay the formal modeling abilities available in SGML. In this way, modeling techniques that have been refined over time to address many common characteristics of documents (such as the construction of element “collections” discussed in Chapter 5, Document Type Modeling and Specification) can be applied immediately by novices.

One tool that we introduce uses the metaphors of “trees” and “ancestry” for information modeling. In computer science terms, every SGML document can be thought of as an inverted tree, with nodes representing elements that branch out to the elements found within them. The top-level element is the “root,” the lowest-level elements containing only data characters are the “leaves,” and any part of a tree terminating in a leaf is a “branch”. A containing element is the “parent” of all the elements it directly contains and a more distant “ancestor” of all the elements contained within it at lower levels. The inner elements are its “children” (if directly contained) or more distant “descendants.

To use the recipe example, recipe is the root element for all recipe documents, the parent element of title, ingredient-list , and instruction-list, and additionally an ancestor of ingredient and step. The ingredient-list element is the parent of some number of ingredient child elements, and instruction-list is the parent of some number of step child elements. The title, ingredient, and step elements are all leaf elements containing only character data.

Figure 2.1, “Some Potential Tree Structures for Recipe Documents” shows an informal representation of several document trees that our recipe DTD can potentially produce; the tree representing the Hawaiian pudding recipe is shown at the bottom of the figure.

Figure 2.1. Some Potential Tree Structures for Recipe Documents

Some Potential Tree Structures for Recipe Documents

Much of SGML information modeling involves generalizing a rule from examples of correct structure. There are a number of formalisms you can use to represent the generalization of individual document tree structures.

One formalism is an outline, where elements are simply indented according to their intended containment relationships. For example:

recipe
    title
    ingredient list
        ingredient
    instruction list
        step

Another formalism is a diagram that shows the same information, only in graphical form. While the diagram in Figure 2.2, “DTD-Level Graphical Description of Recipe Containment Rules” looks very much like the simplest tree structure in Figure 2.1, “Some Potential Tree Structures for Recipe Documents”, it is meant to represent a generalization of the rules for all recipe documents.

Figure 2.2. DTD-Level Graphical Description of Recipe Containment Rules

DTD-Level Graphical Description of Recipe Containment Rules

The outline and containment-tree formalisms are useful for recording examples of structure in existing documents that are being analyzed; they show parent-child relationships between containers and can show multiple levels of nesting simultaneously. However, they don't capture the finer details of the content models, such as the fact that at least one ingredient element and at least one step element are required, and so they don't work very well as tools for developing precise requirements or as formalisms for recording decisions.

Another formalism often used for representing DTD rules is the “railroad diagram.” In the diagrams in Figure 2.3, “Railroad Diagrams for Recipe Content Models”, the boxes represent elements, and the boxes with black triangles indicate leaf elements (elements that contain only character data). The lines with arrows represent the requirements for the order of the elements in the document file; as you follow the lines and arrows from left to right, you come across the valid order of elements inside the element named at the left.

Figure 2.3. Railroad Diagrams for Recipe Content Models

Railroad Diagrams for Recipe Content Models

This formalism does show that at least one ingredient and step are required because, in following the lines and arrows, you must pass through the ingredient element or step element at least once before being allowed to loop around and pass through it again. Thus, railroad diagrams are more precise than the outline and containment-tree formalisms. However, each railroad diagram can describe the content model for only one element, a limitation that can obscure some useful details about nesting levels and structural similarities among different elements; for example, you usually can't get a quick sense of how “deep” an element's content model is when you look at its railroad diagram. Also, railroad diagrams are difficult for nontechnical people to use as a modeling tool, and the combinations of lines and arrows can quickly grow large and unwieldy.

We use a graphical formalism called elm tree diagrams (“elm” stands for “enables lucid models”) for modeling information with SGML; we feel it combines the best features of the other formalisms. Figure 2.4, “Recipe DTD Tree Diagram” shows our elm tree diagram for the recipe DTD at relatively late stage of the modeling work. (This example demonstrates only a few of the features of the tree diagram notation.)

Figure 2.4. Recipe DTD Tree Diagram

Recipe DTD Tree Diagram

Boxes represent elements, and ovals represent locations where collections of the items in the oval are allowed in a freely ordered mixture. Attributes are represented by single lines of descriptive text next to element boxes. The rules for an element's content are shown below its box, with various kinds of lines and symbols indicating various sequence and occurence rules. For example, this diagram happens to use the following parts of the notation:

  • A plus ( + ) on an element box means the element must occur one or more times.

  • The ovals containing “#PCDATA” mean that zero or more characters can be supplied in the elements to which the ovals are attached.

  • A horizontal bracket () below an element box means that each box or oval attached to the bracket must appear in the left-to-right order shown.

Appendix B, Tree Diagram Reference provides a complete reference description of all the features of the tree diagram notation.

This formalism offers an intuitive graphical form that allows people unfamiliar with SGML to express SGML-based modeling requirements with great precision. The tree diagrams grow along with the analysis and design process because they include special notations for representing decisions as yet unmade and pointers to expanded subdiagrams located elsewhere. A collection of complete diagrams also provides a powerful recording mechanism with which to document DTD design requirements for implementors, and is useful for documenting finished DTDs for authors. In working with groups of people who have varying levels of technical skill and SGML training, we have found that elm tree diagrams are a lingua franca that enhances DTD development productivity.

Several software tools are available for the graphical development and display of DTD rules, and they are generally regarded as useful adjuncts to developing DTDs, particularly for people who are new to application development. Don't feel, though, that you have to spend a lot of money on DTD development software to use our conceptual tools and formalisms; we and others have used them and developed DTDs quite successfully with pencils, paper, simple text editors, and validating parsers found in the public domain. Based on your SGML experience and budget, you may want to survey the state of the art in preparation for your own SGML effort.

Now that you have an idea of how the work will proceed, you can form a team for doing DTD development work and launch the project. Chapter 3, DTD Project Management describes how.