Chapter 8. Markup Model Design and Implementation

Table of Contents

8.1. Determining the Number of DTDs
8.1.1. Creating DTDs for Nested Document Types
8.1.2. Creating Variant Element and Attribute Declarations
8.2. Interpreting and Handling Element Content Model Specifications
8.2.1. Handling Specifications That Specify Ambiguous Content Models
8.2.2. Forcing the Occurrence of One of Several Optional Elements
8.2.3. Limiting the Occurrence of Any-Order Elements
8.2.4. Handling Specifications for Mixed Content
8.3. Handling Specifications for Attributes
8.3.1. Designing Enumerated-Type Attributes
8.3.2. Designing ID and ID Reference Attributes
8.3.3. Designing Attributes with Implied Values
8.4. Useful Markup to Consider
8.4.1. Semantic Extension Markup
8.4.2. Markup That Eases Document Conversion
8.5. Designing Markup Names
8.6. Designing Markup Minimization
8.7. Addressing Other Factors in Markup Design
8.7.1. Allowing Markup Characters as Document Content
8.7.2. Defining Entities for Special Symbols and Characters
8.7.3. Creating Text Databases and Templates
8.7.4. Supplying a Default Entity Declaration

While the document analysis report might seem very specific about element and attribute specifications, the correspondence between the specifications and the SGML implementation is not always obvious or complete; a set of tree diagrams or English descriptions is not tantamount to a finished markup model.

You need to transform the specifications into real element and attribute declarations—a task that requires interpretation skills, good two-way communication with the document type design team, knowledge of the syntax and constraints of SGML markup declarations, and an understanding of the environment in which the SGML documents will be created, managed, and processed.

The following sections cover some basic issues to address in interpreting the document analysis report and designing and implementing the specifics of the markup model:

We'll largely ignore the topic of using parameter entities and other SGML “programming” constructs to achieve customizability and maintainability goals. Rather, where we do mention them here, it will be in the context of using them to control the overall characteristics of the markup model created by the DTD. Chapter 9, Techniques for DTD Maintenance and Readability and Chapter 10, Techniques for DTD Reuse and Customization provide details on how to take advantage of these constructs.

8.1. Determining the Number of DTDs

The document analysis report already identifies each of the one or more document classes, in the general sense, required to model the documents in the project's scope. However, you must decide where to draw the lines between related document types in creating individual DTDs.

In general, if multiple document classes are closely related and have even a small chance of being used in combination in a single document, you should build one DTD for them. For example, material user manuals, tutorials, and reference manuals might share many features and might be combined into single delivered documents.

However, the more distantly related the document classes are, the less appropriate it is to combine them into a single DTD. In these cases, it's better to separate out the similar portions and modularize the DTDs so that they can take advantage of a single core set of markup declarations. For example, customer letters and manuals bear almost no structural resemblance to each other, but might share the same information pool. (Section 10.2.1, “Making DTDs Modular” discusses how to modularize DTDs.)

You must also take into consideration the practical nature of the document creation, management, and processing environments, which may argue for creating variant DTDs, such as separate authoring, conversion, and presentation DTDs, based on the reference DTD for each document type. (Variant DTDs are discussed in Section 3.1.3, “The Reference DTD and Its Variants”.)

If you need to implement even a few variant features, you should seriously consider creating multiple DTDs instead of just one. It can be tempting to put all the features into a single DTD, especially if your resources for developing processing applications are limited or if you must compile the DTD for your SGML-aware software systems. However, along with the efficiencies you gain, you are also likely to incur some costs, especially if your documents are written by humans rather than generated entirely by autotagging software. For example:

  • If you choose to provide lax content models and attribute specifications to help the document creation process, you have no way to check whether the documents meet all your completeness criteria when they are finished.

  • If you provide markup for controlling literallayout and formatting in the DTD that is used for document creation, authors can undermine the presentation-independent quality of the markup by using these features.

  • If you include markup that authors are not responsible for using, you steepen their learning curve for the DTD, particularly if the editing environment cannot hide unnecessary markup choices from authors. Further, including unnecessary markup can contribute to any tendencies in the editing software to have performance and memory problems.

If you decide to create variant DTDs for specialized functions, avoid simply copying the reference DTD and making changes to the copied files because maintaining them will rapidly become impossible. You can use a number of techniques to create variant DTDs whose features automatically track those of the reference DTD. Typically, a variant feature has one of two effects:

  • Reducing the scope of the document type to a lower level, in effect creating “nested” document types

  • Modifying a content model or attribute list to loosen or tighten its specifications

The following sections describe how to choose techniques to achieve each of these effects. (Chapter 9, Techniques for DTD Maintenance and Readability and Chapter 10, Techniques for DTD Reuse and Customization describe in much greater detail how to apply SGML “programming” techniques.)

8.1.1. Creating DTDs for Nested Document Types

You can use three different techniques to create nested document types for specialized purposes:

  • Treating a single DTD as if it were multiple DTDs during parsing

  • Modularizing a DTD

  • Using the SGML SUBDOC feature

If you want to keep all your nested document types together in a single stored DTD, simply make sure that the DOCTYPE declaration in each document instance supplies the document type name that matches its document element.

For example, a single DTD for computer software documentation might encompass nested document types for manual chapters, whole manuals, volumes containing multiple manuals, and document sets containing multiple volumes. Figure 8.1, “Tree Diagrams for Nested Document Types” shows how the general structure might look.

Figure 8.1. Tree Diagrams for Nested Document Types

Tree Diagrams for Nested Document Types

A single DTD might appear to cover only the top-level document set, as follows.

<!ELEMENT set           - - (title, volume+)>
<!ELEMENT volume         - - (title, manual+)>
<!ELEMENT manual         - - (title, chapter+)>
<!ELEMENT chapter       - - (title, ...)><!ELEMENT title  - - (#PCDATA)>

However, you could use the following declaration in an instance consisting of a single chapter in order to parse and process it. The declarations for elements at higher levels than chapter are ignored. (Formal public identifiers are used here and throughout this chapter for references to DTDs in DOCTYPE declarations. For more information about formal public identifiers, see Section A.10, “Formal Public Identifiers and Catalogs”.)

<!DOCTYPE chapter PUBLIC "-//Ept Associates//DTD Set of Books//EN">

If this solution doesn't meet your needs, you could instead use a modular DTD structure, which would allow you to target DTDs more precisely to the document instances. To modularize the DTD, put all the declarations related to chapters and their contents in one file, those related only to the manual level in another, and so on. In the files for the upper levels, incorporate the declarations in the lower levels by means of parameter entity references. The declaration for the title element would go in the lowest-level module, since it is needed at all the levels.

For example, the contents of chapter.dtd might be as follows.

<!ELEMENT chapter       - - (title, para*, sect*)>
<!ELEMENT title         - - (#PCDATA)>
<!ELEMENT para          - - (#PCDATA)>
<!ELEMENT sect          - - (title, para*, subsect*)>
⋮

The file for the manual DTD would contain the following.

<!ENTITY % chapter-module PUBLIC "-//Ept Associates//DTD Chapter//EN">
%chapter-module;
<!ELEMENT manual        - - (title, chapter+)>
⋮

The volume-level and set-level files would each similarly point to the next lower module. The document instances at each level would point to one of the four files in order to “activate” the appropriate subset of the markup model, and you could compile and develop processing applications for each nested document type separately.

Note that if you want to use the same file in multiple documents that start at different levels, you need to store the bulk of the file's content separately from its document type declaration. For example, to use a file containing a chapter at multiple levels, the chapter entity file must look like this:

<chapter>
<title>Fruitbats in Their Natural Habitat</title>
⋮</chapter>

A complete chapter document might look like this:

<!DOCTYPE chapter PUBLIC "-//Ept Associates//DTD Chapter//EN"[
<!ENTITY chap SYSTEM "chapter.sgm">
]>
&chap;

A complete manual document might look like this:

<!DOCTYPE manual PUBLIC "-//Ept Associates//DTD Manual//EN"[
<!ENTITY chap SYSTEM "chapter.sgm"
]>
<manual>
<title>Biology and You</title>
⋮
&chap;
⋮
</manual>

Even if you modularize your DTD files for other reasons, you might not be able to manage an environment that has multiple processing applications targeted to each level. If you want to keep all the compiled DTD information together, you need to add a container element at the top level, as follows.

<!ELEMENT computerdoc   - - (set|volume|manual|chapter)>

Then, all your document instances, no matter at which logical level they start, would all have the following DOCTYPE declaration and top-level markup.

<!DOCTYPE computerdoc PUBLIC
        "-//Ept Associates//DTD Computer Documentation//EN">
<computerdoc>
⋮                  chapter, manual, volume, or set markup

</computerdoc>

Finally, you could use the SUBDOC feature to manage nested document types if your SGML-aware software supports this feature.

To use SUBDOC, first make sure the feature is set to YES in the SGML declaration for your document instances. Second, use a DTD like the following for the upper-level document types; here, the DTD for manuals is shown.

<!ELEMENT manual        - - (title)>
<!ATTLIST manual            chapfiles ENTITIES #REQUIRED>
<!ELEMENT title         - - (#PCDATA)>

The DTD for chapters can remain as shown earlier. With the SUBDOC feature in use, chapter instances must have their own document type declaration, as follows.

<!DOCTYPE chapter PUBLIC "-//Ept Associates//DTD Chapter//EN">
<chapter>
⋮
</chapter>

The manual document must declare subdocument entities for the chapter files and make references to them, as follows.

<!DOCTYPE manual PUBLIC "-//Ept Associates//DTD Manual//EN" [
<!ENTITY chap1 SYSTEM "chap1.sgm" SUBDOC>
<!ENTITY chap2 SYSTEM "chap2.sgm" SUBDOC>
]>
<manual chapfiles="chap1 chap2">
<manual-title>...</manual-title>
</manual>

There are a number of drawbacks to using SUBDOC.

  • The processing of an attribute that references an SGML text entity is left completely unspecified by the SGML standard. For example, if you use SUBDOC in the manner shown above, there is no guarantee that the chapter entities will be parsed, validated, incorporated into the document data stream where they are referenced, or whatever other behavior you expect. This fact, and the fact that few product vendors support SUBDOC in any way at all, argue against using it.

  • Using SUBDOC requires that lower portions be stored in separate entities from upper portions.

  • Because SUBDOC relies on attributes with an ENTITY declared value rather than on entity references that appear in the normal flow of markup, you can't enforce that the lower portions conform to the same DTD used in the upper portions.

  • Subdocuments have a different ID name space from that of the containing document, so you cannot use the ID/IDREF mechanism to cross-refer between levels or between subdocuments.

8.1.2. Creating Variant Element and Attribute Declarations

If you want to create variants of your DTD that loosen, tighten, or change content models and attribute lists, parameter entities and parameterized marked sections can help you make the changes while letting the variant track the original DTD in all other respects.

For example, a reference DTD for a technical journal might require content inside submitted papers, because it would be inappropriate to have empty papers when the journal is published. On the other hand, the authoring DTD might allow papers to be entirely empty except for a title so that all the material other than the actual papers (introductions and so on) can be written without generating “content missing” errors when a journal is parsed. Following are the two different forms of the element declaration for paper.

reference form:
<!ELEMENT paper      - - (title, abstract, intro?, section+)>
<!ELEMENT paper      - - (title, abstract?, intro?, section*)>

You can use marked sections to include or ignore conditionally each of the two declarations as follows, changing the definitions of the marked sections to suit you for DTD compilation or document validation. Using marked sections directly allows you to keep the two versions close to each other so that you can compare and correct them as necessary.

<!ENTITY % reference "INCLUDE">
<!ENTITY % editing   "IGNORE">
<![ %reference; [
<!ELEMENT paper      - - (title, abstract, intro?, section+)>
]]>
<![ %editing; [
<!ELEMENT paper      - - (title, abstract?, intro?, section*)>
]]>

Alternatively, you can use a small-scale parameter entity for the portion of the content model that must change, and use either marked sections or differing sets of entity declaration modules to “activate” the correct version of the content model for compilation or validation. In this case, storing a fraction of a content model far away from the declaration in which it's used can make it difficult for you to read and maintain the DTD.

<!ENTITY % paper-content "abstract, intro?, section+">
<!ENTITY % paper-content "abstract?, intro?, section*">
<!ELEMENT paper      - - (title, %paper-content;)>

Chapter 10, Techniques for DTD Reuse and Customization describes how to use marked sections and parameter entities in much greater detail.

8.2. Interpreting and Handling Element Content Model Specifications

SGML offers a great deal of power in constructing content models. Because of the nature of teamwork and the distribution of SGML expertise in the project, you can't necessarily rely on design teams to come up with subtle programmatic solutions to difficult modeling problems. As a result, if you take the document analysis report at face value in designing content models, your resulting element declarations might incorrectly reflect the design team's intent or, worse, be syntactically invalid.

The following checklist summarizes areas where you can uncover potential element-related problems in the document analysis report. Some of these areas are discussed in greater detail in the following sections.

  • Suspiciously Similar Elements

    For example, for divisions at three levels, div1, div2, and div3, the team may have specified the creation of three different division title elements, divtitle1, divtitle2, and divtitle3 .

    The team may have intended to collapse these elements into one element, or may not have considered using the context to distinguish the different roles for a single element. It's best to check with the team members.

  • Single Elements Acting Like Multiple Elements

    For example, within a single list element, the team may have specified content of either normal list items or description blocks for equipment error messages.

    This is often a case of trying to do the work of multiple elements with one. The element should probably be broken out, in this case, to a list element and a message-list element.

  • Recursive Elements

    This often happens with data-level elements, which are often specified to contain a collection of #PCDATA plus “all the data-level elements” without the realization that this includes the parent element itself. For example, the team may have inadvertently allowed cross-reference citations to contain nested citations. The most practical way to deal with undesirable recursion of this kind is often to use a content model exclusion.

    For elements that contain themselves deliberately, first make sure the element isn't required in the content model. If it is, naturally it will be impossible for any instance of this element to satisfy the content model.

    In any case, you need to look into whether very deep nesting (for example, generic document divisions like div that can be nested to dozens of levels within themselves) is a legitimate way to structure the information. You may want to keep the recursive content model, but use some means other than the DTD to catch too-deep nesting (such as having human copy editors check the documents, or writing additional validation software). An alternative to recursive content models is to create a different element for each allowed nesting level, making the level part of the element name (for example, div1 for a top-level section, div2 for the next level of section, and so on).

  • Elements That Act Like Entity References

    In general, it's inappropriate to have an empty element such as include-chapter with an ENTITY attribute for pulling SGML-encoded chapters into a document, since the parser won't validate the external SGML material as part of the document. Regular entity references should be used for this purpose.

  • Inappropriately Complex Content Models

    Make sure that highly complex content models will have a return on the investment. If the burden of applying correct markup is likely to overwhelm authors and the task cannot be made easier through templates, forms, or autotagging software, some data will probably escape being marked up or will be marked up incorrectly. These areas in the document analysis report should at least be watched closely in the DTD testing period. An authoring DTD may be necessary.

  • Ambiguous Content Models

    Just as it's possible to write a content model in SGML that you later discover through parsing has an ambiguity problem, it's possible for design teams to hand you specifications that have an ambiguous nature. In these cases, you must work with the design team to decide how to modify the specifications to make an acceptable content model. Section 8.2.1, “Handling Specifications That Specify Ambiguous Content Models” describes how to solve ambiguity problems.

  • Elements That Could Mistakenly Be Empty

    Sometimes, working on individual components, design teams can overlook the broader consequences of their choices. This situation occasionally happens with specifications for content models containing ordered sequences or any-order groupings of elements, all of which are optional. If leaving them all out is unacceptable, you need to come up with a content model that's quite a bit more complicated than the one suggested by the specifications. Section 8.2.2, “Forcing the Occurrence of One of Several Optional Elements” describes how to construct the alternate content models.

  • Collections Containing Elements That Should Have Limited Occurrence

    Often a specification will come up for a model that contains a collection of some elements, among which there are one or more elements that can't appear multiple times; these must appear once or not at all. This specification is much easier to express in English than in tree diagram form (or in SGML form, for that matter). The team might even supply an inaccurate diagram and supplement it with further English description in a case like this. Section 8.2.3, “Limiting the Occurrence of Any-Order Elements” describes how to construct the necessary content models.

  • Problematic Mixed Content

    Often, the most difficult modeling choices are those for elements that contain #PCDATA, because such elements often need to contain other elements as well. The ISO 8879 standard makes specific recommendations about mixed content models in order to avoid confusion over SGML's handling of white space in document text. Section 8.2.4, “Handling Specifications for Mixed Content” discusses ways to handle mixed content.

8.2.1. Handling Specifications That Specify Ambiguous Content Models

If the document analysis report contains a specification that specifies an ambiguous content model, you must choose an appropriate way to change the model because parsers will report an error in cases of ambiguity.

Certain common kinds of specifications suggest ambiguous content models. For example, the specification shown in Figure 8.2, “Specification for Part Numbers Resulting in Content Model Ambiguity ” specifies that the part-info element can contain part-number twice, the first time optionally.

Figure 8.2. Specification for Part Numbers Resulting in Content Model Ambiguity

Specification for Part Numbers Resulting in Content Model Ambiguity

The part number example may seem far-fetched, but it's adapted from a real-life specification. The reasoning of the team was that “An internal inventory number is always supplied, but we know that if an external number is supplied, it always goes first.” This specification shows that the design team was trying to do two specialized jobs with different occurrences of the same general-purpose element.

Without further analysis, this specification would suggest the following content model.

<!--                      one           two         -->
<!ELEMENT part-info  - - (part-number?, part-number)>

However, this declaration is ambiguous because when a parser comes across the first part number in a document, it won't know without looking past it whether it's supposed to represent the “part number one” or the “part number two” of the element declaration.

Two kinds of solutions are possible: keeping the model general or making new, more specific elements. Both require feedback from your design team before you can proceed.

stay general:
<!ELEMENT part-info  - - (part-number, part-number?)>
or get specific:
<!ELEMENT part-info  - - (ext-part-number?, int-part-number)>

If you and the team conclude that you honestly won't need to query on, format, or otherwise process internal part numbers separately from external ones, you might as well use the first solution, which solves the ambiguity problem. However, it's more likely that you've flushed out a need for more precision in your markup. In this case, you should replace the original general-purpose element with two specialized ones.

Figure 8.3, “Specification for Jokes Resulting in Content Model Ambiguity” shows another common configuration where ambiguity is a problem. Again, it involves trying to do too much work with a single element. The following shows a specification for two document divisions, both of which allow a special-purpose joke element in their introductory mixtures of information unit elements. However, the container for a joke collection also expects to contain joke as its primary content.

Figure 8.3. Specification for Jokes Resulting in Content Model Ambiguity

Specification for Jokes Resulting in Content Model Ambiguity

This kind of specification suggests the following pattern of declarations.

<!ELEMENT division   - - (title, (para|joke|list)+)>
<!ELEMENT joke-set   - - (title, (para|joke|list)*, joke+)>

However, the second declaration is ambiguous because, on occurrence of a joke element in your document, the parser won't know whether it's meant to be an introductory joke or the start of the main joke content of your specialized joke-set container. Three kinds of solutions are possible (assuming you can't entirely do away with introductory material):

  • Add a wrapper element in one of two places.

    
    add a wrapper element in one place:
    <!ELEMENT joke-set   - - (title, joke-intro?, joke+)>
    <!ELEMENT joke-intro - - ((para|list|joke)*)>
    
    
    or the other:
    <!ELEMENT joke-set   - - (title, (para|list|joke)*, joke-group)>
    <!ELEMENT joke-group - - (joke+)>
    

    This solution adds a bit of markup complexity. However, the extra container can be useful in processing and SGML-aware processing.

  • Rename jokes in one context or the other.

    
    change the joke element name in one place:
    <!ELEMENT joke-set   - - (title, (para|list|intro-joke)*, joke+)>
    <!ELEMENT (intro-joke|joke) - - (#PCDATA)>
    
    
    or the other:
    <!ELEMENT joke-set   - - (title, (para|list|joke)*, main-joke+)>
    <!ELEMENT (main-joke|joke) - - (#PCDATA)>
    

    This solution is problematic because it disallows jokes from being easily reused (for example, stored as an entity and referenced) in all the contexts where jokes are allowed, and it increases the difficulty of retrieving all the jokes in a document database.

  • Restrict the collection to exclude jokes.

    
    leave out the general-purpose joke:
    <!ELEMENT joke-set   - - (title, (para|list)*, joke+)>
    

    This solution contradicts the letter of the design team's specification, but may be an option they want to consider. (This solution can get tricky if you use parameter entities to manage your collections, which is usually the case; Section 9.3, “Managing Parameter Entities for Element Collections” describes how you can handle this.)

The decision requires feedback from your team before you can proceed.

8.2.2. Forcing the Occurrence of One of Several Optional Elements

In some circumstances, allowing a container element to be entirely empty may be a strategic decision; for example, you may want to have a reference DTD that disallows empty elements of a certain type, and an authoring DTD that allows them. However, in cases, where an empty container element would be an absurdity, you may need to construct a somewhat complex content model to prevent it from being empty.

For example, the specification shown in Figure 8.4, “Specification for Back Matter That Can Be Empty” is for an optional back matter element containing any number of appendixes, a glossary, and an index, all of which are optional.

Figure 8.4. Specification for Back Matter That Can Be Empty

Specification for Back Matter That Can Be Empty

Without further analysis, the specification would suggest something like the following declaration.

<!ELEMENT back-matter  - - (appendix*, glossary?, index?)>

The back matter element is optional, but even if it is present, it may not have any content. Assuming the design team agrees that an empty back matter element is wrong, you need to ensure that the DTD requires enough information to be supplied.

The following solution might come to mind.

<!ELEMENT back-matter  - - (appendix|glossary|index)+>

However, while this content model ensures that the element won't be entirely empty of content, it removes the requirement for element order and allows multiple glossaries and indexes, while a maximum of one each was desired.

Another solution might suggest itself.

<!ELEMENT back-matter  - - ((appendix+, glossary?, index?)
                             |(appendix*, glossary, index?)
                             |(appendix*, glossary?, index))>

Unfortunately, this content model is ambiguous because parsers can't know, on occurrence of an appendix element, which line of the declaration, as shown here, is the correct one to follow. (Section 8.2.1, “Handling Specifications That Specify Ambiguous Content Models” discusses ambiguity in more detail.) The same problem would occur with glossary , if parsers were able to get that far.

The solution is close at hand, however, and is similar to this content model. Use a “waterfall” approach: Assume for each optional element, in turn, that its occurrence is required and the previous ones are not present at all, and build a model group for it inside a larger OR group.

<!ELEMENT back-matter  - - ((appendix+, glossary?, index?)
                             |(glossary, index?)
                             |(index))>

Figure 8.5, “ Final Tree Diagram for Back Matter That Always Has Content” shows the final tree diagram for this content model.

Figure 8.5.  Final Tree Diagram for Back Matter That Always Has Content

Final Tree Diagram for Back Matter That Always Has Content

You can apply similar logic to any-order groups of elements. For example, if the original specification suggests the following content model, a could be entirely empty.

<!ELEMENT a - - (b? & c? & d?)>

If having an empty a element is wrong, you would need to apply “waterfall” thinking to uncover the following solution. In this case, the inner groups must be made sequential instead of any-order, so that your assumption about which element appeared first can be tested for each of the three cases.

<!ELEMENT a - - ((b, (c? & d?))
                 |(c, (b? & d?))
                 |(d, (b? & c?)))>

Note

#PCDATA in an element's content model can be satisfied with zero data characters—the null string. Thus, for elements that allow #PCDATA anywhere in the content, you cannot use a validating parser to ensure that the element has content. If you want greater control of your SGML documents in this respect, you must build other applications to do the checking.

8.2.3. Limiting the Occurrence of Any-Order Elements

If you are faced with a specification to limit the occurrence of an element that appears among collections of other elements, don't give up: It is possible to construct a content model that doesn't result in a free-for-all.

For example, the somewhat fanciful specification shown in Figure 8.6, “Specification for Song Collection Including Ballads” is for song lists for rock 'n' roll music CDs. Any one CD can have any collection of bombastic rock songs and danceable rock songs, but a maximum of one ballad. The tree diagram inaccurately represents the specification.

Figure 8.6. Specification for Song Collection Including Ballads

Specification for Song Collection Including Ballads

If you allow all the song elements in a collection, as follows, you can't control the occurrence of ballads.

<!ELEMENT songlist  - - (bombastic|danceable|ballad)*>

Another attempt at the declaration might look like the following.

<!ELEMENT songlist  - - ((bombastic|danceable)*, ballad?,
                        (bombastic|danceable)*)>

However, this content model is ambiguous because, before a ballad occurs in the song list, it's not clear whether any instance of a bombastic or danceable rock song should satisfy either the first or the second collection block (ambiguity is discussed in Section 8.2.1, “Handling Specifications That Specify Ambiguous Content Models”).

For the desired effect, move the optionality from the ballad to a group containing the ballad and the final collection.

<!ELEMENT songlist - - ((bombastic|danceable)*,
                        (ballad, (bombastic|danceable)*)?)>

An accurate tree diagram for this solution would look like Figure 8.7, “Final Tree Diagram for Song Collection Including Ballads”.[10]

Figure 8.7. Final Tree Diagram for Song Collection Including Ballads

Final Tree Diagram for Song Collection Including Ballads

A more complicated situation would arise if you had more than one element that needed to be restricted. For example, you might need to model dictionary entries, some of which contain any-order blocks of information about word origin (origin), usage notes (usage), and famous quotations containing the word (quote), along with other general-purpose descriptive elements (block). In any one entry, the specification is for a maximum of one of each of the specialized elements to occur, but for multiple general-purpose elements to be allowed (because each is about a different subject that can't yet be identified).[11] The tree diagram might look like Figure 8.8, “Specification for Dictionary Entry Collection”, which doesn't quite correspond to what is wanted.

Figure 8.8. Specification for Dictionary Entry Collection

Specification for Dictionary Entry Collection

A first attempt at the corresponding element declaration might look like the following.

<!ELEMENT dictentry - - (..., (origin|usage|quote|block)*)>

However, this content model allows multiple occurrences of the specialized elements. Instead, you might try one of the following models.

<!ELEMENT dictentry - - (..., (origin? & usage? & quote?),
                         block*))>
<!ELEMENT dictentry - - (..., (origin? & usage? & quote? &
                         block*))>

However, the first declaration won't allow you to capture general-purpose information before or between the special elements, and the second forces all general-purpose information to be either before or after the other elements, or in one single location between any two of them.

Another option might be to allow block as an inclusion to the content model of dictentry:

<!ELEMENT dictentry - - (..., (origin? & usage? & quote?) +(block))>

However, this model allows block to appear anywhere in the main part of the entry (represented by the ellipsis), as well as anywhere between or inside the special-purpose elements.

If you want the DTD to be highly prescriptive about the desired configuration (that is, if you want a validating parser and not some other mechanism to be in charge of enforcement), you need to resort to the following content model.

<!ELEMENT dictentry - - (..., block*,
                         ((origin, block*)?
                          &(usage, block*)?
                          &(quote, block*)?))>

This model allows any number of general-purpose elements before, between, and after the special elements, allows the special elements in any order, keeps each occurrence of a special element to a maximum of one, and ensures that the content model is unambiguous. This kind of construction gets very unwieldy as more specialized elements are added, and large AND groups make high demands on parsers and DTD compilers because they are shorthand for a choice among many SEQ groups. However, if your project has a strong requirement to do this validation work with the parser, it can be done.

8.2.4. Handling Specifications for Mixed Content

The “leaf” elements of a DTD usually contain #PCDATA. However, many of them must also contain other data-level elements, a situation that must be handled delicately in content models because of potential problems related to lines, or records, of stored document data.

For example, a specification for lists might specify that the list items should contain character data, followed optionally by some paragraphs that further explain the first short phrase. Figure 8.9, “Specification for List Items with Problematic Mixed Content” shows the specification.

Figure 8.9. Specification for List Items with Problematic Mixed Content

Specification for List Items with Problematic Mixed Content

This specification suggests something like the following declarations.

<!ELEMENT listitem  - - (#PCDATA, para*)>
<!ELEMENT para      - - (#PCDATA)>

In the following example, REs are indicated by the symbol RE . Validating parsers will report an error on occurrence of line 6.

1:  <listitem> RE
2:  Oranges RE
3:  <para> RE
4:  Oranges are a lovely RE
5:  orange-colored fruit. RE
6:  </para> RE               ERROR
7:  <para> RE
8:  They are grown in warm climates. RE
9:  </para> RE
10: </listitem> RE

The error is reported because once the first para element in a list item is opened, the content model of listitem prevents it from directly containing any more character data, and the RE at the end of line 6 counts as data.[12]

It's best to follow the recommendation of the ISO 8879 standard to use #PCDATA either alone or in repeatable OR groups, so that you can avoid the possibility of this kind of error. In fact, trying to limit the occurrence of character data is usually a sign that more analysis is needed. You can see the likely need for more analysis from the appearance of the tree diagram in Figure 8.9, “Specification for List Items with Problematic Mixed Content”, where the position of the first block of #PCDATA in the diagram looks like an obvious candidate for an element.

There are two ways to reorganize the model: a collection solution or an isolation solution.

First, you can turn the content model into a collection containing #PCDATA and the other element.

<!ELEMENT listitem  - - (#PCDATA|para)*>
<!ELEMENT para      - - (#PCDATA)>

This model eliminates the risk of parser errors for mixed content problems. However, since list items tend to have their own internal structure, this solution should be avoided if you have another acceptable alternative, because it would allow the out-of-place nonparagraph text on line 7.

1:  <listitem>
2:  Oranges
3:  <para>
4:  Oranges are a lovely
5:  orange-colored fruit.
6:  </para>
7:  They are grown in warm climates.
8:  </listitem>

The collection solution is more appropriate inside elements for which the bulk of the content is likely to be character data, such as full sentences or paragraphs of text.

The alternative solution, and in the case of list items the better one, would be to isolate the character-data portion of the list item's content model in a subelement of its own, either in another para element or in a new, specialized element.

interpret first paragraph as the list intro:
<!ELEMENT listitem  - - (para+)>
<!ELEMENT para      - - (#PCDATA)>
or create a new specialized element:
<!ELEMENT listitem  - - (listintro, para*)>
<!ELEMENT listintro - - (#PCDATA)>
<!ELEMENT para      - - (#PCDATA)>

8.3. Handling Specifications for Attributes

Often, document analysis reports are vague or incomplete regarding attributes and their acceptable values. Because attributes are often used to control specific functions in the processing environment, design teams often defer some of these matters to later stages when processing software will be developed. You may need to do some research to fill in the blanks.

The following checklist summarizes areas where you can uncover attribute-related problems or insufficient specifications in the document analysis report. Some of these areas are discussed in greater detail in the following sections.

  • Subclass Attributes

    In certain circumstances, it may be appropriate to further describe an instance of an element by providing an attribute value (for example, indicating whether a trademarked term is registered or unregistered). However, you may want to consider breaking up the element into separate elements in the following cases:

    • The attribute value is required

      For example, if the content in a productname element must be marked up as being a product either for resellers or for the consumer market, it suggests that the two kinds of product may be sufficiently different to warrant two elements, even if they have the same content model and are allowed in the same contexts.

      An element with a required attribute that has one of ten possible values will need all the documentation and training resources that ten separate elements would have had. Also, depending on the authoring tool chosen, the single-element solution might even be more difficult to apply to document content. Reducing the number of elements only to replace them with attributes can be false economy.

    • Each attribute value is associated with different sets of other attributes

      For example, if you have a single list element with an attribute to indicate whether it is bulleted or numbered and there are other attributes to refine the choice of either bullet or number, it's more sensible to have two list elements. Attributes can't control whether authors provide other attributes, so authors could specify an incorrect combination of attribute values of you put all the attributes on the same element.

    • Each attribute value is associated with different choices of content from the same content model

      For example, if you have a single list element with an attribute to indicate whether it is for regular text or for computer message descriptions and the model allows for either one or more list items or one or more message description blocks, you have a clear case of two elements masquerading as one. Attributes can't affect content models, so authors could specify an incorrect combination of attributes and content if you don't separate out the models.

  • Attributes Containing Free Text

    Sometimes the flexibility to provide an attribute value of arbitrary length and contents is needed; in this case, the CDATA declared value is appropriate. However, the parser has no control over CDATA text, thus inviting consistency problems in specifying attribute values and leaving the job of handling the attribute value entirely in the hands of processing applications. You should consider whether NAME is a better choice.

    If the attribute value is intended to be output as document content (for example, a title attribute), the attribute should almost certainly be made into an element instead. If the markup model has many cases of CDATA attributes in it, they should at least all be used consistently (treated as keywords, as invisible author comments, as document content, or whatever).

  • Presentational Attributes

    Formatting information can adversely affect the data's portability and longevity, whether it's in an element or an attribute. You should find ways to abstract away from presentation-related information where possible (though it's better to capture this information in attributes, if it must be supplied).

  • Current” Default Values

    Beware of default values that are supposed to pick up the most recently assigned value for an instance of that element. Such attributes would need a default value of #CURRENT. These attributes are problematic because they effectively tie down the element to one static location in the document, making the element entirely dependent on its linear context. For any dynamic document-building scheme or for document types that do not depend on the linear order in which elements are provided, such attributes are inappropriate.

  • Correspondence of Common Attributes

    If possible, attributes with the same name on different elements should have the same purpose and attributes with different purposes should have distinct names. Otherwise, you'll have a problem documenting the proper use of the attributes.

  • Attributes with Enumerated Values

    Section 8.3.1, “Designing Enumerated-Type Attributes” discusses strategies for handling attributes for which you want to create a list of values.

  • ID and ID Reference Attributes

    Attributes with a declared value of IDREF place an extra documentation burden on the DTD implementor. Section 8.3.2, “Designing ID and ID Reference Attributes” discusses strategies for handling attributes that use this SGML mechanism for symbolic identification.

  • Attributes with Implied Values

    Attributes with a default value of #IMPLIED place an extra documentation burden on the DTD implementor. Section 8.3.3, “Designing Attributes with Implied Values” discusses the considerations in designing attributes with implied values.

See Section 10.2.4, “Making Markup Names Customizable ” for information on using attributes in a special way for variant DTDs.

8.3.1. Designing Enumerated-Type Attributes

Sometimes it's appropriate to provide a finite list of choices as the declared value of an attribute. For example, the following declaration creates a status attribute with two possible values: draft and final.

<!ATTLIST doc
        status  (draft|final)   #IMPLIED
>

In these cases, you may find that it's a good idea to leave the way open for values you haven't yet thought of. For example, for the doc element, you might want to do the following to allow for different levels of draft status that might be designated in the future. Unfortunately, it isn't valid SGML.

<!ATTLIST doc
        status  (draft|final|NAME)      #IMPLIED
>

What this declaration tries to accomplish is a kind of semantic extension (a concept discussed more in Section 8.4.1, “Semantic Extension Markup ”), since the DTD is, in some cases, leaving it up to the document instance to supply some of the semantics of the content. Following are the ways you can accomplish this goal for attributes.

  • Use a NAME or NMTOKEN declared value instead of enumerating the valid literal token values.

    <!ATTLIST doc
            status  NAME            #IMPLIED
    >
    

    This solution leaves you no way to control, through the parser, what attribute values are supplied. Authors and conversion programs must adhere to conventions for supplying values (or must use other software to check for adherence) if consistency is desirable.

  • Enumerate the token values, add an other value, and add an additional attribute to hold the new values.

    <!ATTLIST doc
            status          (draft|final|other)     #IMPLIED
            otheratt        NAME                    #IMPLIED
    >
    

    When other is supplied in the status attribute, only then should the processing application notice the value of otheratt, if it has any. This is an excellent system for controlling values while allowing an escape hatch, its main drawback being that you can't force a value for otheratt to be supplied if the status value was other. (Section 8.4.1, “Semantic Extension Markup ” describes additional ideas for semantic extension along these lines.)

    Note that if you don't have an other keyword and simply use the impliability of the status attribute to indicate that otheratt should be examined, you prevent the possibility of implying a value for status besides other; for example, you won't be able to assume the value is final based on whether the document was set to be read-only in the database.

  • Use parameter entities to allow the list of enumerated token values to be extended in an internal declaration subset of the DTD.

    newvals can be extended by being redefined as "|newvalue"
    
    <!ENTITY % newvals "">
    <!ENTITY % statusvals "draft|final">
    <!ATTLIST doc
            status  (%statusvals; %newvals;) #IMPLIED
    >
    

    This solution works almost as well as the previous one, but has some of the dangers of allowing markup model extension. (The technique and the risks are discussed in detail in Chapter 10, Techniques for DTD Reuse and Customization .)

A typical enumerated list of attribute values serves as a Boolean (two-state) toggle: yes/no, on/off, or true/false. Boolean attributes are the kind most often affected by SGML's requirement that the set of declared token values for all attributes on a single element type be unique.

This requirement exists because of the SHORTTAG minimization feature (controlled by the SGML declaration), which, among other things, allows authors to omit those attribute names and specify only their values in element start-tags, as follows.

<doc final>content...</doc>

So what's the problem? Say you want to put these two attributes on your document element:

  • An attribute to record whether the document is orderable through the online sales catalog

  • An attribute to record whether comments on the document are solicited from readers

Parsers will report an error for an attribute list declaration like the following.

<!ATTLIST doc
        orderable       (yes|no)         yes
        comments        (yes|no)         yes
>

This declaration sets up the possibility that an author could supply the following ambiguous markup.

<doc no>content...</doc>

Two approaches are common in avoiding this problem. One is to pack more information into each token, as follows.

<!ATTLIST doc
        orderable       (orderable|notorderable)  orderable
        comments        (sendcomments|nocomments) sendcomments
>

The other is to use NUMBER declared values, where "0" is understood to be no, off , or false, and any other number is understood to be yes, on, or true (the number-value approach can work for more than two distinct values, of course).

<!ATTLIST doc
        orderable       NUMBER          1
        comments        NUMBER          1
>

Several industry-standard DTDs use the NUMBER declared value in this fashion to good effect. However, in situations where authors must directly choose values (as opposed to, for example, table-editing environments that manage attribute values away from the authors' view), you might want to use meaningful keywords instead. Keywords will be easier to document and may avoid confusion that results from mistaking the logical sense of the attribute (“Is 1 supposed to mean it's orderable, or it's not?”).

8.3.2. Designing ID and ID Reference Attributes

It's common to use the ID declared value for attributes that contain an element's symbolic identifier, because validation by an SGML parser will ensure that all IDs within any one document are unique and that any attributes containing IDREF references to these IDs are valid. In addition, most SGML-aware applications are prepared to perform special processing on ID and IDREF values. Thus, the choice of ID for element identifiers is appropriate for most DTDs.

However, you might find the length or character requirements on SGML names (on which the ID declared value is based) to be too restrictive. If you're able to make your processing applications perform any of the validation and other work that would have come for free with ID/IDREF, you could use CDATA instead. In this case, attributes containing references to IDs should also have a declared value of CDATA.

No matter how you declare your ID attributes, you may find that it's useful to allow IDs on every element type, so that you can manage, process, and store elements by their identifiers.

For each attribute that serves as a reference to a unique identifer (of whatever declared value), make sure to determine the element types whose identifiers should be able to be referenced from the attribute, and document the results in DTD comments or documentation. For example:

<!ELEMENT glossref - - (#PCDATA)>
<!ATTLIST glossref
        link    IDREF   #REQUIRED --to glossentry--
>

8.3.3. Designing Attributes with Implied Values

From the perspective of document creators, an attribute with a default value of #IMPLIED simply means an attribute value that is “optional to supply,” just as if an actual default value were available. But developers of processing applications need to know more:

  • Can the application proceed without any value at all for the attribute?

  • If not, how should the application derive the correct value?

When you implement attributes with #IMPLIED default values in a DTD, you should be able to answer these questions in DTD comments or documentation.

Following are some common answers.

  • An attribute might be truly optional, not needing to be supplied for the application to proceed.

    <!ATTLIST para
            id      ID      #IMPLIED --OK not to have value--
    >
    
  • It might need to be inherited from the nearest ancestor with a specified value.

    <!ELEMENT example          - - (title, computer-listing)>
    <!ATTLIST example
            audience        (novice|expert) novice
    >
    <!ELEMENT title            - - (#PCDATA)>
    <!ELEMENT computer-listing - - (#PCDATA)>
    <!ATTLIST computer-listing
            audience        (novice|expert) #IMPLIED
                                            --get from parent--
    >
    
  • The attribute value might depend on the element's context in the instance.

    <!ATTLIST numlist
            numstyle        (arabic|alpha)  #IMPLIED
                                            --occurs inside self?--
    >
    
  • The value might depend on characteristics of the environment in which the information was assembled, formatted, or retrieved.

    <!ATTLIST procedure
            audience        (novice|expert) #IMPLIED
                                            --value of USERTYPE
                                              environment variable?--
    >
    

8.4. Useful Markup to Consider

Certain kinds of elements and attributes may be useful to add to any DTD. This section describes markup for “semantic extension” to extend the life of the DTD and markup to help in conversion of data.

8.4.1. Semantic Extension Markup

It's unlikely that a DTD development project can anticipate every semantic component needed in the data format. Unusual document samples can come to light after the environment has been developed and tested, and new needs can always arise. Further, the more precise and complex the markup model, the more likely that it will be unsuitable for data that is only slightly different. Therefore, it's a good idea to add markup to a DTD that can serve as “escape hatches” to capture data that would not have been marked up otherwise, as long as conventionally agreed-on uses of the escape hatches are well documented.

We call this notion semantic extension. You can use semantic extension markup to help you extend your markup model indefinitely, but more likely you will want to use it to inform the design of future versions of your DTD and help manage conversion of your documents to the new versions.

For example, you might have defined a highly structured model for light bulb jokes. It might have elements or attributes for the kind of person changing the light bulb, the number of people required to do the job, and so on. The more you make your element the embodiment of the light bulb joke model (as opposed to, say, the elephant joke model, or the model of jokes in general), the less possible it will be to use the element for encoding other jokes.

In this case, you may want to provide a corresponding general-purpose element for encoding the (possibly as yet unknown) other kinds of jokes that may not fit the existing molds, and allow data and markup in it that can be free-form to the degree necessary. For example, along with the lightbulb-joke element, you might create a general-joke element that simply contains one or more paragraphs. If it later becomes clear that the general element is being used frequently for a class of jokes (for example, knock-knock jokes) for which more SGML structure would be beneficial, a new element for that structure could be created and instances of the general element could be converted to instances of the new element.

General-purpose elements such as this are, in effect, abdicating some of the responsibility for identifying the element's type. Since you can't know beforehand what the precise type is, it's helpful to provide an attribute to allow (or even require) the author to fill it in per instance of the element. This attribute is sometimes called role. For example:

<general-joke role="knockknock">
<para><quote>Knock knock.</quote></para>
<para><quote>Who's there?</quote></para>
⋮
</general-joke>

Especially if your DTD will be an industry standard or will otherwise be widely used by many different audiences, you might want to consider putting the role attribute on every element, even the specialized ones. In this capacity, it provides a powerful semantic extension mechanism that guarantees compliance to the original element structure, attribute list, and general intention, while adding information about the element's “flavor” that individual organizations can use in their own environments. Using this attribute can even help decrease the pressure on organizations to extend the standard markup model materially for their own use.

For example, suppose a DTD for software documentation offers an element for general random-order lists, but not for specialized lists of restrictions on software functionality. If a department using the DTD wants to be more specific in marking up some of its lists, it can “borrow” all the characteristics of the original list but add one detail:

<!DOCTYPE randlist [
<!ELEMENT randlist  - - (item+)>
<!ATTLIST randlist
        role            NAME            #IMPLIED
>
<!ELEMENT item      - - (#PCDATA)>
]>
<randlist role="restrictions">   now functions as a restrictions list

<item>
Command line output redirection doesn't work.
</item>
<item>
File names cannot have more than three characters.
</item>
</randlist>

Armed with this extra knowledge about its lists, the department can format restriction lists differently from regular lists, keep a statistical database on software restrictions, and so on. At the least, it will be prepared for an expected future version of the DTD that will incorporate a new element for this purpose.

Note that the role attribute here was defined to have a declared value of NAME so that its value would be forced to be a “keyword.” However, if you want to allow document creators to supply longer descriptions of the role, CDATA is more appropriate.

Authors often request a set of one or more elements for font control (bold and so on). These elements are usually inappropriate to add to the DTD. Supplying a role attribute value on existing elements can actually diminish the need for these elements, since sometimes authors merely want to mark up a new “flavor” of an existing data-level element. If the content of such elements needs to be formatted differently from the default, authors must negotiate with the application developer to make their new role attribute values be recognized by the formatting software. This requirement for negotiation helps you control the misuse of elements a little more closely.

If your DTD puts a set of common attributes on all its elements (for example, for revision status and security access), you may want to consider adding a general-purpose data-level element that can serve as an “attribute hanger” for otherwise unremarkable regions of text that are smaller than a paragraph. Such an element is sometimes called phrase.

If your markup model has both specific elements and general-purpose elements for semantic extension, you run the risk of Tag Abuse by authors, who might use the general-purpose versions where the specific version is warranted. Author training, markup review, and markup assistance (through structured editors, templates, forms interfaces, or generation of data and markup from other sources) can help lower this risk.

8.4.2. Markup That Eases Document Conversion

In converting legacy documents to SGML or transforming SGML documents to conform to a different DTD (or to the same DTD but with different or augmented contents), you might have a series of conversion DTDs that get the data closer and closer to the ultimate desired form. It may make sense to add special markup to DTDs that holds the results of conversion.

If the target DTD is less specialized than the source form, you may want to retain information about the source markup as you convert so that you can later convert in the other direction. To hold this information, you could put an attribute on every element in the DTD. Such an attribute is sometimes called remap.

<!ATTLIST elem
        remap   NAME    #IMPLIED>

For example, suppose the source form has a specialized element for lists of software restrictions (this is the reverse of the situation described in Section 8.4.1, “Semantic Extension Markup ”). If you convert an instance of this element to a DTD that has no such specialized list but has a general-purpose list with the same structure, you can record the original semantic information in remap.

<!DOCTYPE randlist [
<!ELEMENT randlist  - - (item+)>
<!ATTLIST randlist
        remap           CDATA           #IMPLIED
>
<!ELEMENT item      - - (#PCDATA)>
]>
<randlist remap="restrictions">   used to be a restrictions list

<item>
Command line output redirection doesn't work.
</item>
<item>
File names cannot have more than three characters.
</item>
</randlist>

If the source and target forms are so different that no such simple one-to-one mapping is possible, or if you anticipate a conversion whose results will not be 100 percent correct, other markup may be more helpful. The conversion software might have occasion to make notes on the ongoing process, for example, noting that the results in one case or another are merely a guess at the correct element. It could output these notes as SGML comments.

<!-- *** CHECK: probably the wrong element choice -->
<specialterm>...</specialterm>

However, comments are discarded by SGML parsers and therefore can't be used in further processing. It would be more helpful for later processing stages or for the manual cleanup process to use a conversion-specific element and attributes that can store important conversion information.

<!DOCTYPE document [
<!ELEMENT document - - (...) +(conversion-note)>
<!ELEMENT conversion-note - - RCDATA>
<!ATTLIST conversion-note
        date            CDATA           #IMPLIED
        source          CDATA           #IMPLIED
        remap           CDATA           #IMPLIED
        problem         CDATA           #IMPLIED
        notes           CDATA           #IMPLIED
>
⋮
]>
⋮
<conversion-note
  date="11 Sep 1995 14:23.00"
  source="troff"
  problem="wrongelem"
  notes="Might be a commandname">
*****************************************
WRITER: The following specialterm may be
a commandname instead. You should check
the intent and change the markup as 
necessary.
*****************************************
</conversion-note>
<specialterm>...</specialterm>
⋮

If special elements are used to mark up data during conversion but are not actually added to your DTD, conversion personnel can use a validating parser to help them find and fix conversion problems.

8.5. Designing Markup Names

Application developers and authors know an element by its generic identifier—its name.[13] If your documents will be generated by a conversion or assembly process and won't ever need to be seen by human eyes as they are processed, you could probably get away with giving the markup unintuitive names; a paragraph element could be called x23 and no one would care. A similar situation might hold for structured editors with aliasing schemes and displays that completely hide the real element names from users.

However, the more likely case is that at some point authors, application developers, or other people will actually have to see and understand the element names and other markup names as they are declared in the DTD. Therefore, they should be as easy to read and understand as possible, given length constraints.

Note

This section makes suggestions about English markup names, although some advice may apply to other languages as well, particularly Western languages.

In naming elements, you need to respect the technical jargon that has been used by the design team. Especially if you're not a subject-matter expert yourself, it's important that you check with the team before choosing names that stray from the terminology used in the document analysis report.

For most environments, it's useful to give elements relatively long names so that similar elements are easily distinguished and people don't need to keep looking things up in the DTD documentation. However, there are some factors that will encourage you to keep names short. The project planning documents should help you determine your situation.

  • Markup Is Keyed Directly into Files

    If SGML documents will primarily be created through the use of unstructured editors to enter data and markup, limit the length of markup names to cut down on keyboarding (along with setting up the environment to offer shortcuts, templates, and forms for help with markup insertion). You'll see even bigger benefits from using a markup minimization strategy, if your environment supports minimization (discussed in Section 8.6, “Designing Markup Minimization”).

    One way to increase readability while keeping the typing to a minimum is to use longer names for infrequently used high-level elements and shorter names for data-level elements that occur frequently in text.

  • Names Must Adhere to Default NAMELEN

    Must you adhere to the reference NAMELEN quantity in the SGML declaration because of parser or other application constraints? The reference value for NAMELEN is 8, but few SGML-aware applications force you to adhere to this limit, and some widely used applications, including those for CALS, use a NAMELEN value of 32. By contrast, in environments where authors are accustomed to short, cryptic markup or where compatibility with all environments is a goal, they may insist on very short SGML markup names—eight, four, or even fewer characters.

  • Storage Space Is Limited

    If storage space for SGML document files is a concern, short markup names may be necessary. However, you could achieve greater savings with an aggressive markup minimization strategy, if your environment supports minimization (discussed in Section 8.6, “Designing Markup Minimization”).

  • Conversion Services Are Based on Character Counts

    If you are paying “by the byte” for data conversion services, you may want to keep your costs low by choosing the shortest markup names possible. If you then want to lengthen the names for editing purposes, you can perform a simple one-to-one mapping.

If it's necessary to abbreviate words in the markup names, take a moment to plan when and how to abbreviate. For example, will you eliminate vowels to get spscr for your superscript element (which doesn't help much and is cryptic), or will you simply shorten the word to super or even sup? It's sometimes difficult to identify consistent patterns of abbreviation, partly because languages themselves aren't consistent. Try to choose “natural” abbreviations that authors themselves might use if they were, say, taking notes during a meeting.

For long markup names that are hard to read, you might want to use certain characters to break up the names. The basic SGML declaration allows hyphens ( - ) and periods ( . ) in the second and subsequent character positions of markup names. These characters allow you to break up an element name such as emailaddress into email-address or email.address.

You might find that you want to add a prefix to subelements of a highly structured element to serve a self-documentation function, indicating which element they belong in and helping authors find them in the alphabetized reference documentation for the DTD. Naturally, it's a good idea to keep the prefix consistent. For example, in your glossary elements, don't name the “term” subelement glterm while naming the “definition” subelement glosdefn.

By default, markup names other than entity names are case insensitive. You can change this setting in the NAMING section of the SGML declaration (described in Section A.9, “SGML Declarations”) and add to the set of characters allowed to be used in markup names, though in general it's not a good idea to do so. In certain circumstances where legacy documents already contain markup that uses a larger set of characters, it may be necessary to extend the set of name characters, typically to add the underscore ( _ ) character. Note that not all SGML-aware software supports changing NAMING settings.

One final suggestion: The SGML standard recommends that all attributes in a DTD with the declared value ID have the same name. By convention in the industry, that name is usually id, and attributes with the declared value IDREF are not named id. If you use a different convention, you should have a very good reason for it, and should document these attributes carefully.

8.6. Designing Markup Minimization

Markup minimization allows markup in a document to be represented by fewer characters than it would need in a fully normalized document. It can be thought of as a kind of markup compression. You might want to design a minimization strategy to help authors who must type in their own markup, or to save on disk space. Because most SGML-aware editors manage the insertion of markup without requiring authors to type it in, these editors don't make use of minimization when they write out SGML document files.

Here we'll briefly discuss the most commonly used types of minimization and the effect they can have on markup model design:

  • Short reference minimization

  • Short tag minimization

  • Omitted tag minimization

Short references allow regular data characters to be interpreted as markup, for example, using keyboarded quotation marks (") to stand for quote element start- and end-tags. This mechanism is very powerful in certain circumstances and can dramatically decrease the character count of markup in SGML document files, but can be tricky and time-consuming to implement. Short reference minimization is available to all SGML documents, but it may be meaningless to use short references in SGML-aware software environments that hide the actual markup from authors.

Short tag minimization allows selected pieces of individual tags to be left out. You enable it by setting the SGML declaration's SHORTTAG feature to YES. Once this feature is enabled, you have no control over the extent to which it is used in the document files, and some kinds of short tag minimization can make the files extremely hard to read.

Following are the ways in which tags can be shortened, in order from most useful (and least distracting) to least useful (and most distracting).

  1. Attribute values in start-tags can do away with quotation marks if their content conforms to the rules for SGML NAME tokens.

    <document status=final>
    
  2. Some attribute values can be supplied without their corresponding attribute names and equal signs ( = ).

    <document final>
    
  3. An empty pair of end-tag delimiters ( </> ) can be used to represent an end-tag for the most recently opened element.

    The <command>grep</> command...
    
  4. The start- and end-tags for an element can be modified in form as follows.

    The <command/grep/ command...
    
  5. In certain circumstances, an empty pair of start-tag delimiters ( < > ) can be interpreted as the top-level document element, a start-tag repeating the previously opened element, or an end-tag.

  6. When two tags are next to each other, the ending delimiter ( > ) of the first one can be left out.

    <section<title>Title Text
    
    </para</section>
    

Omitted tag minimization allows certain start-tags and end-tags to be left out of document instances based on specifications provided in element type declarations. It is a less efficient but more easily managed method of minimization than either short references or short tags.

If it makes sense for your project to use omitted tag minimization, you should plan the pattern of omission you will assign to the elements. Often, it is confusing to read SGML document source files that have been heavily minimized; consistency can alleviate this problem. Following are some suggestions:

  • In general, avoid allowing start-tags to be minimized. The absence of clues as to the current element can be disconcerting to readers of the document files. Also, if attribute values must sometimes be specified for such an element, the pattern of markup for that element will inconsistently go back and forth because a start-tag must be present for attribute values to be supplied.

  • The document hierarchy elements can often have their end-tags omitted because they have distinct components that are incompatible with each other. For example, a new chapter start-tag implicitly closes the previous chapter, preface, or other similar division. For example:

    <!DOCTYPE doc [
    ⋮
    <!ELEMENT chapter - O (title, ...)>
    ⋮
    ]>
    <chapter><title>Chapter Title</title>
    chapter content
    <chapter><title>Another Chapter Title</title>
    chapter content
    

    This pattern of omission doesn't save many tags, but for some reason it is common in environments where omitted-tag minimization is heavily used.

  • The elements for information units tend to need an end-tag under all circumstances so that they can be closed unambiguously. For example, if a note element can contain paragraphs and can also appear at the same level as paragraphs, a /note end-tag will always be required. However, paragraphs and subelements for information units, such as items inside lists, can often have their end-tags omitted. For example:

    <!DOCTYPE doc [
    ⋮
    <!ELEMENT section  - - (title, (para|note)+)>
    <!ELEMENT title    - - (#PCDATA)>
    <!ELEMENT note     - O (title, para+)>
    <!ELEMENT para     - O (#PCDATA)>
    ⋮
    ]>
    <section><title>Section Title</title>
    <para>
    This is a paragraph. When the note comes along,
    this paragraph will be implicitly closed
    because paragraphs can't contain notes.
    <note><title>Note</title>
    <para>
    This is a paragraph inside the note. The
    only way to tell that the note is done is
    to use a note end-tag; another para start-tag
    would be interpreted as being inside the note.
    </note>
    
  • For data-level elements in the free flow of text, it can be convenient to allow end-tags to be omitted if you expect them to appear relatively often just before the end of a higher-level element. Typically, however, the presence of the data-level end-tag will be required.

8.7. Addressing Other Factors in Markup Design

In addition to the basic markup model expressed in the DTD, you may be responsible for other related features of the DTD and markup-related pieces of the “documentation engineering toolbox”:

  • Characters usually used as markup delimiters that are needed in document content

  • Entities for special symbols and characters

  • Text databases and templates

  • Default entity declaration

The following sections discuss these issues.

8.7.1. Allowing Markup Characters as Document Content

As with all markup systems, SGML has the problem of how to allow document instances to contain, as data content, characters that are normally interpreted as markup. The two main examples are the left angle bracket ( < ), which normally begins a tag, and the ampersand ( & ), which normally begins an entity reference. If a document contains one of these characters followed by a valid NAME character, validating parsers will report an error if the string turns out not to be a legitimate piece of markup.

For instance, this example produces an error because <b looks like the beginning of a start-tag that is improperly closed.

If a<b then fill in the total amount here.

The following example, however, produces no error because <4 can't possibly be the beginning of a start-tag—a digit is not allowed to be the first character in an element name (unless you've changed the NAMING settings in the SGML declaration).

This game uses only the cards <4 in the deck, including aces.

If you don't provide any mechanisms for allowing these characters to appear as content, authors do have one method at their disposal for inserting the character: a CDATA marked section, which surrounds a region that should not be parsed for element or entity reference markup. For example:

If a<![ CDATA [<]]>b then fill in the total amount here.

CDATA marked sections can't contain a series of two right square brackets ( ]] ) as content, since this string is treated as the markup that closes the marked section.

The easiest general way to enable left angle brackets and ampersands as data content is to provide SDATA (“specific” character data) text entities corresponding to the characters, so that when the documents are processed after being parsed, the appropriate characters will be inserted in place of the entity references. For example:

for ampersand:
<!ENTITY amp SDATA "INSERT-AMPERSAND">
for left angle bracket ("less-than" symbol):
<!ENTITY lt SDATA "INSERT-LESSTHAN">

These entities might be used as follows.

The Mother&amp;Daughter Moving Company is having a sale!
We guarantee that your moving expenses will be &lt;five
hundred dollars.

SDATA entities, by definition, contain instructions that are specific to one system or processing application, and so they can present portability problems. However, the standard ISO entity set for numeric and special graphic symbols, discussed in Section 8.7.2, “Defining Entities for Special Symbols and Characters”, comes with ready-made entities for left angle brackets and ampersands that should work with most SGML systems. Thus, if you include this entity set in your DTD, the entities will be available to authors.

Note

If you use an SGML declaration that makes changes to the reference concrete syntax (the default assignments of characters to markup delimiter “roles”), for example, substituting a left square bracket ( [ ) for a left angle bracket, make sure that SDATA entities are available for your choices of markup characters.

For any DTDs that will be used for the actual documentation of how to use an SGML system, it is a common problem to need to “escape” examples of tags and document instance fragments. You may want to add element types to the markup model that take the self-referential nature of the information into account. For example, you might have a data-level element called start-tag that, when processed, outputs the necessary tag delimiters.

<!DOCTYPE dtd-manual [
⋮
<!ELEMENT start-tag  - - (#PCDATA)>
⋮
]>
<dtd-manual>
⋮
At the beginning of a trademarked term,
use <start-tag>trademk</start-tag>.
⋮
</dtd-manual>

8.7.2. Defining Entities for Special Symbols and Characters

If your documents need to contain symbols and characters that can't be entered directly into document files with keyboard key presses, the DTD needs to define SDATA (“specific” character data) text entities that represent the symbols so that applications can add the appropriate symbol during processing.

The ISO 8879 standard defines several sets of SDATA entities that can be used to produce many common symbols and characters. For example, the “ISOlat1” entity set defines the following symbol, among others:

<!ENTITY eacute SDATA "[eacute]"--=small e, acute accent-->

If this entity declaration is included as part of the DTD, document instances can refer to the entity. For example:

<para>
This dish makes a perfect entr&eacute;e on a
cold winter's night.
</para>

Commercial SGML-aware products capable of handling the ISO entities can output the correct symbol in place, with the correct point size, weight, and so on:

This dish makes a perfect entrée on a cold winter's night.

Many SGML formatting applications support the standard SDATA instructions for the ISO entities “out of the box,” and even include the entity sets and various predefined formal public identifiers with their products in order to simplify the use of the entities in any DTD. For example, the file containing the ISO entity set for publishing characters has probably been set up to map to the following public identifier:

ISO 8879:1986//ENTITIES Publishing//EN

If so, you can include the following parameter entity declaration and reference in your DTD to make these entities available to authors:

<!ENTITY % ISOpub PUBLIC "ISO 8879:1986//ENTITIES Publishing//EN">
%ISOpub;

Appendix D, ISO Character Entity Sets summarizes the available ISO entity sets and shows one possible formatted representation of the symbols and characters. The entity sets for Latin 1 characters, diacritical symbols, numeric and special graphic characters, and publishing characters are probably the most frequently needed for general publishing. If you want to use smaller portions of any of the sets, you can assemble only the entity declarations you need, as long as you include the copyright statement found in the original files. Remember to ensure that all your processing applications support the characters and symbols you make available in the DTD.

If you have an application that needs the instructions in these entity sets to be in a different form, copy the entity sets to new files, change the values for the entities, and make sure that the application is directed to the changed files instead of the original ones.

If the ISO entity sets or other available entity sets don't meet your needs, you may need to define your own SDATA entities. For example, if your documents need a character that looks like a happy face, you can define something like the following entity:

<!ENTITY happyface SDATA "INSERT-HAPPYFACE">

Your applications must then be made to operate appropriately on the SDATA value passed to them during parsing, such as inserting the symbol and ensuring it is of the appropriate point size. For example:

<para>
If the command returns a status of 0,
the outcome was successful.&happyface;

This reference to the entity might result in the following output.

If the command returns a status of 0, the outcome was successful.

For each application or system that needs a different system-specific SDATA instruction, you will need to provide a different definition for &happyface;. For example, if you need to prepare your documents for a simple character-cell display with no graphics capabilities, the processing for that system might use an alternate set of entity declarations that includes the following:

<!ENTITY happyface SDATA ":-)">

The use of this declaration in processing might result in the following appearance; in this case, the SDATA value is simply passed on to the output.

If the command returns a status of 0, the outcome was successful.:-)

If the character set being used for your documents happens to have a character position that corresponds to a happy face, you can make the SDATA instructions contain a special kind of entity reference called a “character reference,” which explicitly calls for the character that resides in a certain numbered position. Character references look like regular entity references, except that their opening delimiter consists of an ampersand followed by a number sign ( # ), and the entity “name” that is referenced must be a number. (Documents can also contain character references directly, without needing an SDATA entity to be defined.) For example, if the happy face is in position 99:

<!ENTITY happyface SDATA "&#99;">

Using character references make your document files less portable because they depend on a particular character set rather than on processing that can change according to the circumstances.

8.7.3. Creating Text Databases and Templates

Many content-based markup models are meant to facilitate the creation of a text fragment database for assembling document content. For example, glossary entries are often treated this way, as are legal publication statements such as copyright notices and trademark attributions. If the project's utilization goals include this sort of text reuse and it is to be accomplished using SGML entities, you, the DTD implementor, will probably be responsible for creating and maintaining the entity declarations and ensuring they are included in the DTDs used by all documents that need them.

The project may also need templates containing data and markup, so that authors can work from a common base when writing new documents. This is especially true in cases where some editorial guidelines aren't able to be enforced through validation with a parser. You are likely to be responsible for developing and maintaining the templates and ensuring that they incorporate valid use of the DTD.

Section A.9.1, “Document Character Set” describes how an SGML declaration sets up the interpretation of numeric codes as characters.

8.7.4. Supplying a Default Entity Declaration

If a document refers to an entity that has not been defined, validating parsers will report an error. To avoid this error during validation but help authors find the problem in formatted text or attach special processing to these occurrences, you can provide a declaration in your DTD that will be used for any entity references whose entities haven't been declared explicitly. Supply #DEFAULT as the entity name.

<!ENTITY #DEFAULT "ENTITY NOT DEFINED!">


[10] Note that if you want to ensure that the CD isn't entirely empty, as discussed in Section 8.2.2, “Forcing the Occurrence of One of Several Optional Elements”, you need to use the following waterfall model:

<!ELEMENT songlist - - (((bombastic|danceable)+,
                        (ballad, (bombastic|danceable)*)?)
                       |(ballad, (bombastic|danceable)*))>

Here, either a ballad or one of the other two types must appear in the song list.

[11] In this case, the general-purpose element works like an “escape hatch” to account for needs not anticipated or quantified during DTD development. We call this concept semantic extension; we discuss it further in Section 8.4.1, “Semantic Extension Markup ”.

[12] If the parser can get as far as line 9, another error will be reported for its RE. Normally, REs just after start-tags and just before end-tags are ignored because they can unfailingly be interpreted as being separator characters, and the RE on line 9 appears just before the listitem end-tag on line 10. In this case, however, because SGML parsers can't look ahead to determine conformance, the RE on line 9 will be flagged as misplaced #PCDATA before line 10 is ever reached.

[13] They might also know an element by the attribute that identifies the architectural form to which it conforms. See Section 10.2.4, “Making Markup Names Customizable ” for more information.