Chapter 6. Modeling Considerations

Table of Contents

6.1. Distinctions Between Components
6.1.1. Multiple Elements
6.1.2. Single Element in Different Contexts
6.1.3. Single Element with Partitioned Content Models
6.1.4. Single Element with Multiple Attribute Values
6.2. Container Elements Versus Flat Structures
6.3. Documents as Databases
6.4. Strictness of Models
6.5. Divisions
6.6. Paragraphs
6.7. Generated Text
6.8. Augmented Text
6.9. Graphics

Every DTD development project is unique. However, many of the modeling problems you face will probably have much in common with the problems of other projects, particularly if the main goal of the project is to publish and deliver documents. This chapter presents some general advice about common modeling issues. (Chapter 8, Markup Model Design and Implementation discusses complex modeling problems that usually need the help of a DTD implementor to uncover and resolve.)

6.1. Distinctions Between Components

The primary question in SGML modeling is, “How should we use elements and attributes to distinguish among kinds of information?” There is no single right answer. The following sections use the example of bulleted lists and numbered lists to explore the major possible choices and their likely effects on modeling and processing application implementation.

It is a common misconception that restricting the number of elements in a DTD means the DTD is less complex. For authors and application developers to understand and use a DTD properly, not only every element but every attribute value and every novel context for an element must be documented; in other words, what DTD users need to understand is the components, rather than just the elements. Therefore, if all the components represented in the model are necessary (which you should have established sufficiently by the time you've performed step 4), then the markup is necessary too, and keeping the number of elements to a minimum by using contexts and attributes creatively won't help. If the presence of many elements threatens to make authoring more difficult, the team can consider creating an alternate model for an authoring DTD (see Section 3.1.3.2, “Authoring DTDs”), or using methods of markup assistance such as forms interfaces, markup templates, or the customization of an SGML-aware editor, in order to help simplify the process.

6.1.1. Multiple Elements

The simplest and most common way to model components is to map each component to a unique element type. For example, numbered lists and bulleted lists might each have their own element types: number-list and bullet-list.

The basic currency of an SGML document is the element. All SGML processing applications can locate unique elements and process their content, so designing a model that relies on unique elements is a simple way to help ensure that the data can be processed in the desired ways. Also, every element type can have its own unique content model, attribute list, and allowable contexts, which is a level of flexibility you might want in your modeling efforts. However, if you feel that the number of elements has become overwhelming for users of the DTD, or that not enough distinction exists among the components to justify separate elements, you can use one of the other choices.

6.1.2. Single Element in Different Contexts

Another way to model components is to map multiple components to a single element type that occurs in various contexts. For example, numbered lists might be identified by a list element when it is inside a procedure element, and bulleted lists might be identified by a list element in all other contexts.

A common way to use this modeling choice is to create a single title element, and use its occurrence inside various division elements, such as chapter, appendix, and section , to identify each unique titling component. This is a good modeling choice as long as the markup characteristics of the single element type are suitable for all contexts in which the element will be used, and as long as the various kinds of information can be considered as logically “the same.” If they seem fundamentally different, separate elements may be more appropriate, even if the elements are otherwise identical. For example, a division title and a bibilographic citation to the title of a different document are not logically the same thing, even though you might want to call them both “titles” and might design the same content model for othem. Thus, two different elements would be appropriate.

If the context requirements result in processing applications having to perform an especially complex query on the SGML documents (for example, “if element X is the third from the last child inside element Y, and Y is anywhere inside Z but not directly inside W”), it is likely that there are multiple logical distinctions that can be made, which may result in multiple elements rather than just one. At the least, the DTD implementor or application developers may request model changes or suggest simplifications.

6.1.3. Single Element with Partitioned Content Models

A somewhat complex way of modeling components is to map multiple components to a single element type, relying on the use of a different portion of the available content model in each instance. For example, numbered and bulleted lists might both be represented by a list element, but instances of one would contain a series of number-item elements and instances of the other would contain a series of bullet-item elements. The content model would make the item choices be mutually exclusive.

The most common case in which this modeling choice is used is to allow an optional title inside another element (such as a figure). If the title is present, the figure is formatted in one way, and if the title is absent, the figure is formatted in another way. This choice usually does not have a very strong rationale because the content-model alternative can't be controlled by the DTD, and authors could choose the “wrong” alternative in inappropriate circumstances. Further, many applications find it difficult to apply processing to elements based on characteristics of their internal contents (as opposed to characteristics of the outer context in which they appear).

6.1.4. Single Element with Multiple Attribute Values

You might choose to model components by mapping multiple components to attribute values on an element type. For example, numbered and bulleted lists might both be represented by a list element that contains a series of item elements, but a type attribute value of number would identify numbered lists and a value of bullet would identify bulleted lists.

It was once the case that many processing applications couldn't handle attribute values very well, but tools now generally support attributes. Therefore, this is a legitimate modeling choice, particularly if the attribute value represents a “flavor” of the entire element rather than corresponding to some fragment of its content model (as described in Section 6.1.3, “Single Element with Partitioned Content Models”). For instance, a trademark element for trademarked terms might have an attribute value indicating whether the term is a registered or unregistered mark. However, if the attribute value chosen has a dramatic effect on the meaning of the element or on its anticipated processing, you should consider whether the components would be better represented with multiple elements, even if their markup characteristics are identical.

Note

You cannot constrain a content model through attributes, nor can you constrain one attribute by means of another attribute. If you want to apply different attributes to different configurations of content and you want validating parsers to be able to enforce the constraints to the highest level possible, you need a separate element type for each configuration.

If presentation or other processing information must be supplied in an SGML document, try to store the information in attributes if possible.

6.2. Container Elements Versus Flat Structures

When is it a good idea to realize a “grouping” component as an element? There are two main considerations: the complexity of the group's contents and the sophistication of the desired processing. These factors are somewhat related because highly complex models are usually the result of a strong motivation to process the data creatively.

For example, the complexity of the group shown below, indicated by the arrow, suggests that a container element for the whole group would be a logical addition.

Many people consider container elements to be nuisances because they can easily swell the overall number of elements in a DTD. If the presence of such elements is justifiable but they are found to get in the way of authoring, various solutions are possible. First, in non-SGML-aware authoring environments, omitted-tag minimization can obviate the need to provide most container elements. Second, authoring DTDs can remove the container elements entirely and transformation engines can add them back when they are needed.

However, container elements can be an asset rather than a liability. Typically their presence is required, so SGML-aware authoring tools can insert them automatically and can make it easy to manipulate and display whole blocks of information. Also, as already mentioned, processing applications can use container elements to their advantage in associating various kinds of processing with the beginnings and ends of the blocks.

For example, a definition list model usually has one of two basic structures:

Even if all you want to do is format the definition list information in a simple way that doesn't require the container element, its presence can make it easy to count how many entries have been written and to rearrange the order of the entries.

Finally, consider whether a group of elements might need to have an ID or any other attribute values attached to it. If so, it may be necessary to add a wrapper element on which to put the attributes.

6.3. Documents as Databases

For information that lends itself to management in a database or to database-like treatment, you can open up many possibilities for sophisticated processing by designing relatively flat models representing “records.

For example, in Chapter 4, Document Type Needs Analysis, we used the example of restaurant take-out menus to demonstrate the process of uncovering and interpreting structure. We mentioned two possibilities for organizing groups of related dishes on a menu: structural section elements somehow labeled with the type of dish versus content-based elements specifically to gather the soups, the steamed dishes, and so on. A third possibility would be the interpretation of each dish as a “record,” one of whose characteristics is its type. No explicit grouping is done in the SGML document at all; rather, a database of dishes is built up, and the relevant dishes can be extracted based on their types as needed. The goals for the CookThis project described in Chapter 4, Document Type Needs Analysis suggest exactly this kind of handling.

If the processing required for such a model was not planned to be developed or acquired, the motivation for the extra processing must be weighed against the cost. Following are examples of the consequences of choosing one type of model over another.

A traditional glossary entry model might look like this:

This structure allows for both providing a definition and providing a cross-reference to a definition, for example in cases where an acronym is listed in the glossary solely in order to help readers find the expanded version of the term. Both kinds of entries need to be put into alphabetical order for traditional paper-based publishing, either by an author or by an application that does assembly and sorting.

<gloss-entry id="tas-def">
<gloss-term>Tag Abuse Syndrome</gloss-term>
<gloss-def>
A condition that afflicts authors who choose
inappropriate markup to get a certain formatting
effect or choose markup that isn't as precise
or accurate as possible. A poor DTD design often
exacerbates the problem.
</gloss-def>
</gloss-entry>
⋮
<gloss-entry>
<gloss-term>TAS</gloss-term>
<gloss-see gref="tas-def">
</gloss-entry>
Tag Abuse Syndrome

A condition that afflicts authors who choose inappropriate markup to get a certain formatting effect or choose markup that isn't as precise or accurate as possible. A poor DTD design often exacerbates the problem.

TAS

See Tag Abuse Syndrome.

The drawback to the traditional model is that it requires the phantom entries and their links to the real entries to be maintained by authors, as if the phantoms were real entries themselves. Its benefit is that an application that creates and sorts the phantom entries doesn't have to be developed.

A more sophisticated database-like model might look like this instead:

In this case, every entry in the original SGML document represents a single glossary entry “record,” and an application must generate “phantom” entries for the alternate terms and alphabetize as necessary.

<gloss-entry id="tas-def">
<gloss-term>Tag Abuse Syndrome</gloss-term>
<gloss-term-alt>TAS</gloss-term-alt>
<gloss-def>
A condition that afflicts authors who choose
inappropriate markup to get a certain formatting
effect or choose markup that isn't as precise
or accurate as possible. A poor DTD design often
exacerbates the problem.
</gloss-def>
</gloss-entry>
Tag Abuse Syndrome

A condition that afflicts authors who choose inappropriate markup to get a certain formatting effect or choose markup that isn't as precise or accurate as possible. A poor DTD design often exacerbates the problem.

TAS

See Tag Abuse Syndrome.

The database-like model does require an extra application to be developed, but such an application can help greatly with consistency and ease of document maintenance.

Intuitively, it seems ideal to store only as many entries as there are term/definition sets, which would suggest a choice of the database-like model. However, constraints on application development or conversion processes may suggest that a nearer-term solution using the traditional model is best for the moment. In the absence of application issues, however, the more sophisticated model is usually better, if there is enough motivation to actually implement the clever utilization ideas that the team and the developers come up with.

Make sure you explicitly state all your processing expectations for complex models. If the database-like model had elements for a term and its “abbreviation,” rather than a “primary term” and “alternative lookup terms,” and if the processing expectation held that the primary term is always the preferred lookup term, authors couldn't choose to list an abbreviation as the preferred term (for example, with the definition under TAS instead of Tag Abuse Syndrome).

Let's look at a second example: bibliography entries. A very simple example of a traditional model might look like this:

A database-like model might instead look like this:

The traditional model makes it easy for authors to get their desired bibliographic format by providing their own punctuation, and it avoids having to make an application insert the necessary symbols.

<biblio-entry>
<author>Digregorio, Charlotte</author>,
<title>Your Original Personal Ad</title>:
<publisher>Civetta Press</publisher>;
<date>1995</date>.
</biblio-entry>

Digregorio, Charlotte, Your Original Personal Ad: Civetta Press; 1995.

However, what if the approved bibliographic format used by the organization were to change? The old bibliographic format has been “locked in” to the files, and typographical errors and missing fields can't easily be caught.

The database-like model can make all the entries consistent and change the bibliographic format through application wizardry. Further, with the database-like model, entries are much more likely to have all the pieces of information identified properly and consistently, so that the entries can be searched on as if they were records in a simple relational database.

<biblio-entry>
<author>Digregorio, Charlotte</author>
<title>Your Original Personal Ad</title>
<publisher>Civetta Press</publisher>
<date>1995</date>
</biblio-entry>

Digregorio, Charlotte, Your Original Personal Ad, Civetta Press, 1995.

Note that if your model requires every field to be marked up but leaves the element order variable, the advantages of the database-like approach will be undermined.

6.4. Strictness of Models

It's a good strategy to start with a model that is relatively tightly circumscribed and expand it as necessary during testing. This way, the later changes will be compatible with any converted or newly written documents, and authors won't have a chance to use dubious markup features before testing has determined whether the features are appropriate. However, DTDs don't have the ability to deprecate or recommend models, only to allow or disallow them. Therefore, if a particular configuration of markup is needed, even if it is appropriate only rarely, the model must allow it.

Of course, even the most prescriptive DTD isn't able to enforce every single rule that you may want your documents to adhere to. For example, a DTD can't ensure that an element that's supposed to contain a string of data characters actually contains any, because the various keywords that represent data characters (such as #PCDATA) stand for “zero or more characters.” And you can't write a content model that ensures that all your “cautions” start with imperative verbs. In these cases, you will need to rely on additional validation applications or human checking to make sure the documents meet all requirements.

DTDs are often used to encode editorial and stylistic guidelines, so that companies can keep authors from violating the guidelines simply by requiring that documents be validated before they are accepted for publication. This is often one of the main motivations for migration to SGML in the first place. The question you will need to ask in your modeling effort is: When should a guideline be a DTD rule? For example, you might need to consider whether your sections should be required to contain at least one paragraph or other information unit.

If by allowing a looser model the DTD could produce nonsensical document content, or could produce document content that causes problems for processing, then the model should be restricted. If a restriction would not cause actual problems, there are other factors you can examine to help you make the decision.

  • Realism of the Restriction

    If the documents in the project's scope have a problematic structure but must remain that way, the reference DTD needs to accommodate them. For example, your company might be planning to convert legacy documents that were written before the time when editorial rules had been established (or at least before the time that sufficient validation was being done). However, if you are able to upgrade the quality of documents being newly written, the authoring DTD can encode the desired restrictions in order to help authors conform to the rules.

    Beware of prescribing strict content models for a whole company or industry just because a small group of people—the design team—thinks the existing documents are “flawed.” Often you must ignore real legacy and sample documents in making such restrictions. For example, if some documents really do have divisions that split immediately into subdivisions without containing any explanatory paragraphs, a rule that requires divisions to have paragraph content is unrealistic.

  • Enforceability of the Restriction

    If authors can subvert the intention of the rule by committing Tag Abuse (for example, inserting empty paragraph elements in order to avoid validation errors), then you'll need to use additional applications or human checking anyway, despite the fact that the DTD “enforces” the rule. Once the additional validation is in place, you may want to remove the enforcement of the rule from the DTD to reduce the annoyance factor for authors.

    Some rules impose restrictions that authors would tend not to violate anyway, except through making an honest mistake. For example, where a title is required, authors will tend not to insert an empty title element to get around the rule. In these cases, the DTD will probably enforce the rule sufficiently, and any usual editorial reviews will catch the remaining problems.

  • Benefits of the Restriction

    Authors usually feel that restrictive markup rules are an imposition on the creative process. However, particularly in SGML-aware environments that offer a palette of markup choices to the user in each context, restrictions can be a real benefit. It does the author no good to be presented with a dozen choices of element when only two or three are really appropriate; in fact, such inefficiencies in the authoring process can also raise the costs of training, support, and copyediting. Even if the reference DTD must be relatively loose for some reason, it may be helpful to create a tighter authoring DTD.

6.5. Divisions

In modeling the document hierarchy, you will probably need to deal with divisions. Most large documents use division elements to collect their information into topical bundles. For example, novels have chapters and technical reference manuals often have “reference modules.” Your choices for division structure and nesting need to be sensitive to both your processing expectations and the authors' writing methodology, which each exert an influence on the other.

Chapters provide a good example of the importance of expectations on writing style. Chapters in a technical manual might very well be read out of sequence, and perhaps even accessed in a hypertext environment independently of the rest of the manual. By contrast, chapters in the average novel would be very unlikely to make sense if read out of order. Technical writers will tend to make the content of individual chapters relatively independent of information residing elsewhere (except in cases where an explicit cross-reference can be added), while novelists will be free to use transitional prose at the beginnings and ends of chapters. In these two cases, even though the chapter structures are similar, there is no guarantee that the information's functional roles in the documents are similar.

Technical information tends to be organized into multiple levels of division because of its complexity. Chapters or their equivalents might allow for subdivision several levels deep, or in some cases might even “skip” levels. Multiple levels of division pose additional modeling challenges.

Following is a checklist of issues to consider when modeling divisions.

  • Navigation by Table of Contents

    The natural hierarchy present in a document is usually an effective place to start when readers begin to navigate a document. Most document-browsing software presents a hypertext table of contents that shows the division titles at the various levels, and they might also allow for other presentations based on the hierarchy, such as dynamic collapsing and expanding of the divisions. Many other SGML-aware processing applications also expect to find a coherent hierarchy to work from.

    If your model allows some levels of division to be skipped or to be used in inside-out order, you may not be able to take advantage of the hierarchy present in your documents when you use these applications.

  • Relationships with Upper, Lower, and Sibling Divisions

    Even though successive levels of containment are useful in organizing content, a hierarchical organization can obscure other useful relationships between divisions. The relationships mainly have to do with the type and degree of dependency that a division has on other divisions above, below, and beside it.

    Does a lower-level division need the information provided at higher levels to make sense? The expectations for the management and retrieval of dependent and independent divisions will most likely be different; in fact, it's most likely that dependent divisions will always be stored along with their ancestral independent container.

    If the reader accesses a dependent subdivision separately, what will be missing? How will the reader be informed what the missing pieces are? Most browsing applications can automatically offer readers a way to move “up in the tree” to fill in gaps in their knowledge, a useful feature for dependent divisions. For independent divisions, authors may need to hand-craft cross-references to related subjects.

    Do the divisions at any one level need to be read in order? It's usually easy to tell whether content-based division markup has a significant order, but hard to tell about structural markup; the safest assumption is that the author's supplied order is significant. For example, content-based encyclopedia entries are in alphabetical order merely for convenient lookup, but the sections in a technical manual may or may not be organized in random order. If you plan to take an especially sophisticated approach to information management and navigation, you may want your markup model to reflect different kinds of division that are “randomly ordered” versus “sequentially ordered,” or even to reflect divisions that are “optional” versus “required” for understanding of the subject.

  • Division Depth

    The maximum depth of nested divisions is a contentious topic. The guardians of editorial guidelines usually prefer a relatively flat structure for ease of reading and navigation, but in the new world of shared content and automatically assembled documents, authors often prefer an arbitrary number of levels so that they can freely promote and demote information in the hierarchy when they reuse it in new locations. The markup model will have to reflect the needs for reuse and transplanting of marked-up information.

    Many DTDs have a structural division model that uses explicitly numbered divisions to limit the depth of nesting: section1 , section2, and so on. Since this markup model would impede efforts to reuse or transplant a section at one level to a different level, a different model might be in order. Often, the simplest solution seems to be to allow divisions to contain nested versions of themselves, to an arbitrary depth—that is, recursive divisions. This would allow any “subtree” of information to be transplanted anywhere in any document's hierarchy. As already mentioned, however, even if the markup model is suitable for transplanting, the prose may not be. The lower the level of division used when the information was first written, the more likely that the division has a highly dependent role and can't be reused in other divisions at the same level, much less be reused at a different level.

    (Note that some SGML-aware authoring environments handle automatic promotion and demotion of divisions by actually changing the markup as necessary. Thus, it may not be necessary to cater to “transplantation” concerns with an authoring DTD, but with software customization.)

    If shared content and heavy reuse are goals for your project, instead consider identifying islands of reusable content—content-based modules that are designed to be highly independent. You can allow them anywhere in your document hierarchy that you wish, and give them their own internal hierarchy that will always be required to travel with their main division. The internal hierarchy can be allowed to nest as deeply as needed, though usually it is kept relatively flat (no more than two to three internal divisions) for editorial reasons. Recipes, encyclopedia entries, and UNIX™ man pages are perfectly suited for this treatment.

    If content-based markup doesn't make sense for your divisions, it may still be helpful to identify a structural module element that can serve the same function. In fact, several popular writing methodologies, such as Information Mapping™, define such modular units.

6.6. Paragraphs

In the traditional word processing world, a paragraph is well understood: It is a freely wrapped block of characters, usually arranged into sentences, that uses vertical space and indenting for separation from other paragraphs. This meaning is enshrined in the notion of “paragraph styles,” which are collections of procedural instructions relating to the vertical space, indenting, and margins of such blocks. We can call these “simple paragraphs.” An abstraction of their model would be something like this:

In the world of hierarchical containment, however, a less presentational and more structural definition of a paragraph might apply, perhaps something like the following: It is a very small (perhaps “atomic”) subdivision of a document that addresses a single topic and is presented to readers in such a way as to indicate its unity. Of course, there is still no question that paragraphs must somehow be formatted with vertical space, line breaks, or some equivalent. However, this definition would encompass the entirety of the following prose:

Before disassembling the motor, gather the following equipment:

  • A socket wrench (part number 1750–A)

  • A piece of string (part number 1750–B)

These parts will be essential to the disassembly and reassembly processes.

It is clear that without the list, the first sentence is incomplete, and that without the second sentence, the point introduced by the first sentence remains unfinished. A container element for both the character data and the list might be called a “complex paragraph,” and an abstraction of its model might look like this:

It is useful to store the entire unit in a single paragraph container for reasons of authoring and formatting convenience.

First, the container allows the content to travel together without leaving stragglers behind when being reorganized. It is quite common for authors to write a set of paragraphs in a section, and then change the order several times to get the “flow” just right. Having the list travel automatically with the two sentences can be immensely useful in an SGML-aware authoring environment.

Second, the container can help the formatting process, precisely because it represents a logical grouping rather than a presentation-based one. For example, if you plan to use paragraph formatting with an indent on the first line, the container element can also do away with the need to have a “continued paragraph” element or an attribute on paragraphs to indicate when they shouldn't indent: Any block of characters after the first one is obviously a continuation of the same paragraph, and can be formatted as such with no other intervention by an author.

Many existing DTDs use complex paragraphs, with some even making “one or more paragraphs” the entire main content of divisions. Even if the concept seems odd at first, it may be worth your consideration.

At this point, you might object that perhaps this new container is not a paragraph at all, but rather a larger “chunk” or “nugget” of some kind—or perhaps a multiple-paragraph container is needed in addition to hold entire “topics” or “threads” within a division. This idea raises interesting questions. What is the size of granule that users need to retrieve from documents? Below which size is it impractical to retrieve information without also providing its surrounding context? How can the benefits and costs of authoring with multiple “paragraph levels” be balanced? (Section 6.5, “Divisions ” discusses some of these issues.)

Often, simple paragraphs, complex paragraphs, and threads don't make ideal retrieval objects; they are all too dependent on the context provided by the rest of the division (or other significant container) in which they reside. In these cases, the value of making such fine distinctions may be minimal.

6.7. Generated Text

Generated text (that is, strings of characters interwoven through the output of a document by a formatting application) plays a large role in most stylesheets developed for the formatting of SGML documents. For example, list items might need to have bullets or consecutive numbers inserted, and page numbers might need to be output somewhere on each page. Generated text can help achieve stylistic and editorial consistency, but it can also pose some modeling challenges.

If certain text strings will be required in the output (for example, because of corporate style policy), it's usually best to generate them and not to have an element that represents the generated text. For example, if all note elements must have a title of “NOTE”, there is no point allowing or requiring a title element on notes. A stylesheet will do a much better job of inserting the right title text than authors will, and the stylesheet can ensure that the title is always spelled and capitalized correctly (and that the correct language is used, if the document is translated or maintain in different languages).

For the application developers to know they are expected to build this behavior into their formatting applications, you need to document your processing expectations; in this case, part of the basic “meaning” of the note element would be that it has a title, and that title always reads “NOTE.

If instead of requiring a “NOTE” title, the corporate policy allows individual authors to override a default value, you have the following basic modeling options:

  • Require a title element on notes and use other means, such as providing markup templates or customizing the text editor, to instruct authors to prefer the text “NOTE” in the title content.

    This option removes any special title-processing expectations from the note element; it just has a regular title, similarly to many other elements.

  • Make the title element on notes optional; when it is not supplied, use “NOTE”, but when it is supplied, use its content as the title.

    This option has processing expectations that are more complex than they might seem. When the title is absent, the note is nonetheless still a “titled” element because a title will be supplied on output; it's just that the SGML document instance doesn't explicitly represent this fact about the note. Any processing actions on titled elements will need to be applied to notes without title elements, too.

    (Note that if the formatting of the entire note happens to depend on the presence or absence of the title element, processing applications may need to operate in more complex ways, because they will already have encountered the note's start-tag by the time they determine whether the title was supplied.)

  • Always represent the title in the markup somehow, whether or not the default title is used.

    This modeling option may require some some advice from the DTD implementor. For example, it might be appropriate to design a special kind of attribute for the title element (an attribute with a #CONREF default value) that, when filled in, represents the generated title and prevents element content from being supplied, and when left blank, allows element content to be supplied. While this is a tricky solution, it actually makes the document instances seem more “whole” and makes the processing expectations plain.

Note

If you design a markup model that allows authors to override generated text, you are building some Tag Abuse potential into the model. For example, if authors can change the title of a note, they may be able to simulate information of a different kind, such as warnings, by manipulating the title text accordingly. You will need to balance the desired flexibility against the cost of ensuring markup precision.

The most difficult problem in dealing with generated text is planning how it will fit into its surroundings—such as the middle of a sentence. Your DTD implementor and the application developers might need to help you sort out the various processing expectations. By the time the DTD is implemented, it's essential that the expectations be precisely determined and documented.

To illustrate the importance of processing expectations, let's assume you have designed a model where an empty figure-cross-ref element links to a figure. The plan is that a stylesheet will generate some text that represents the figure, in place of the empty element. If the cross-reference is to the third figure in the first chapter, the generated text might be any one of the following (assuming a scheme where figures are numbered in a way that is subordinate to chapter numbers):

  1. 1–3

  2. Figure 1–3

  3. See Figure 1–3

  4. (see Figure 1–3)

  5. (See Figure 1–3.)

Each of these choices would require a different sentence structure to surround it. Some choices, such as the full parenthetical sentence, are more independent of their surroundings than others, which may be a useful quality. However, no matter which choice has been implemented in a stylesheet, if an author expects a different choice, the formatted result could be nonsensical. For example, if the stylesheet produces the effect of choice 5 but an author assumes that 1 is being used, sentences in the document might read:

For an illustration of the julienne technique, see Figure (See Figure 1–3.).

Similar problems might arise if glossary terms were stored in a database in the singular form (for example, “file name”) and referred to with links in the document that stylesheets will replace with the term names. All references would need to be worded to support singular rather than plural constructions.

These problems are due to the variety inherent in written and spoken natural language. One way to overcome them is to expect that generated text will always be wholly independent of their surroundings, and to make this expectation clear to all authors. Another is to build presentational markup into the model that allows authors to have control over the form of the generated text. For example, the glossary term database mentioned above could store variant forms of each term:

Lowercase First initial captial All initial capitals Plural lowercase etc.
file name File name File Name file names  
system System System systems  

Then, each cross-reference to a glossary term could supply an attribute value indicating which form would be appropriate to generate in the current context.

Note

If your documents will be translated, the difficulties of fitting generated text into its surroundings will become even more complex. If you do add any presentational attributes, it's likely that they won't have precise analogs in any of the other languages. Word ordering may be an additional problem.

6.8. Augmented Text

Large blocks of generated text are often called “augmented text” because they add to the content of a presentation instance in a fundamental way, and may even be “fed back” into the original SGML instance. Tables of contents are an example of augmented text. Following are suggestions for choosing how to model augmented text. The DTD implementor and application developers should have a hand in the discussion if possible.

  • No Markup

    In many cases, it isn't necessary to represent augmented text in the markup model; processing applications can just build the appropriate output along the way. This is usually the ideal choice because different presentation instances may have widely differing requirements on how the augmented text will appear and where it will be output.

  • Metainformation Placeholder

    You could allow a placeholder element (usually an empty element) for the construct to appear somewhere in the metainformation container, regardless of where it will appear in the final processed document. For example, you might have an optional empty “TOC” element somewhere in the metainformation element. This solution allows authors to control the presence or absence of the generated construct in the types of presentation instances that have tables of contents.

  • Location-Specifying Placeholder

    You could allow a placeholder element to appear in the linear order wherever it should appear in the output. For example, you might allow the “TOC” element to appear optionally before the preface, after the preface, and after the appendices. This solution allows authors to control the position of the table of contents in some types of presentation instances.

  • Full Markup Model

    You could design an element for the construct that has a real content model, and allow it either in the metainformation or in the linear order (as described above). For example, you might have a “TOC” element in the metainformation that contains one or more “TOC entries,” each of which contains a title and either a page number or a link to the relevant division.

    This solution allows authors to provide some default information or to correct generated information that has been fed back into the document instance after processing. This solution is also useful if the final delivered form will be the SGML documents themselves, in which case it may be useful to recipients to take delivery with this metainformation already generated. However, keep in mind that the accuracy of the generated information is guaranteed only just after it has been generated.

    Glossaries are similar to metainformation in that they help readers with their understanding of the content, but they usually also provide subject matter content, and their placement in the output might be controllable by an author. Some companies generate glossaries from databases of defined words based on the presence of those words in the content, which makes the glossaries act somewhat like tables of contents. However, sometimes authors must still “craft” a definition or provide a definition that isn't in the database, so a structure must usually be provided for glossary elements.

6.9. Graphics

Most document information containing “characters” can be modeled in SGML, whether or not the characters are arranged as sentences. The decision in these cases often involves choosing how “content-based” or “structural” to make each semantic component.

However, sometimes the information being modeled seems so presentational that an SGML representation is pointlessly inefficient. For example, a graphical bitmap needs to look exactly the way it was created, with every pixel controlled. You could model the color of each pixel using elements and attributes, but storing the bitmap in a non-SGML form, such as a TIFF format, makes more sense. In these cases, the markup model needs to provide a way to point to or include the non-SGML data, such as a graphic element with an attribute that references an entity. Often, presentational markup is needed so that authors can gain the appropriate control over the appearance of the graphic. The DTD implementor can help with the technical aspects of this modeling task, and the project plan should indicate the non-SGML formats needed.

Just as with all components, though, you shouldn't overlook opportunities to interpret the graphics as being content-based. For example, if you are modeling a set of books on card games, each graphic showing how to deal a hand can be interpreted as mere graphical data—or it can be interpreted as cards. If the contents of the graphics can be modeled as “spreads” showing “hands” and “stocks” of “cards,” with each card having a “suit” and a “rank,” the necessary graphics might ultimately be generated from such markup. Given that card-game books are your business, the cost of developing the software to generate the graphics might easily pay you back in:

  • A lower budget for original artwork

  • Graphics that are consistently laid out

  • The flexibility to change the designs on the backs of the cards to reflect the new logo of the publisher in subsequent editions

  • The ability to publish for print-disabled audiences with “graphic equivalents” explaining the spreads

  • Error-checking to catch mistakes where the same card is mentioned twice in the same spread

If your processing gets really sophisticated, it could even create a multimedia environment in which card deals and draws are shown in animated fashion.