Chapter 11. Validation and Testing

Table of Contents

11.1. Setting Up and Managing a Bug-Reporting System
11.2. Validating the DTD
11.3. Validating the Markup Model
11.3.1. Wrong or Overly Constrained Model
11.3.2. Overly Broad Model
11.4. Testing the Use of the DTD in the Real World
11.4.1. Usability with Applications
11.4.2. Usability with People

You need to perform several kinds of testing to determine whether your DTD design goals were met in the actual implementation. As a result of your testing, you may need to improve on the markup model of the reference DTD, or you may need to adapt the model of one or more variant DTDs to a particular processing application or environment. For example, usability testing might result in simplification of the authoring DTD so that authors can better choose and apply markup.

This chapter describes the steps in testing and reviewing the DTD and what to look for. Testing and review involve the following steps:

  1. Reviewing the document analysis report

    The user group and other reviewers should make sure to put their comments in writing and to back up their suggestions with examples. It's best if they actually fill out a bug report (described in Section 11.1, “Setting Up and Managing a Bug-Reporting System”) to record problems.

  2. Validating and reviewing the DTD “code

    The DTD implementor is responsible for ensuring that the DTD is valid at each stage of testing and revision. Section 11.2, “Validating the DTD” discusses how to do this.

    Technical experts should have the opportunity to read the draft DTD, with the document analysis report and the DTD maintenance documentation (discussed in Section 12.2, “Documentation for Readers of the DTD”) in hand. Again, written comments in the form of bug reports are best. Often, the comments gathered at this stage reflect the experts' desire to have a “pure” DTD; if there are reasons the DTD has compromised on some desirable features (such as mixed content models in the form that the ISO 8879 standard approves of), the implementor should have recorded them in the maintenance documentation.

  3. Testing the markup model provided by the DTD

    It's essential that you test the DTD by marking up sample documents. For this task, you need a conversion application or people willing to mark up documents manually in an editor (which can be SGML-aware or not, depending on the expertise and training resources available). Problems should be recorded in bug reports.

    Section 11.3, “Validating the Markup Model” describes factors to be aware of during the testing.

    Anyone marking up documents to test the DTD will need documentation. Usually the DTD user documentation is far from being finalized at this point, but at a minimum the draft of the reference manual, a list of all the available elements with their definitions, and a set of tree diagrams should be provided. (Chapter 12, Documentation discusses the components of user documentation.)

  4. Testing the SGML application (DTD and SGML declaration) with processing applications

    When you use the DTD and its related material with authoring, processing, and information management applications and introduce it to authors, you may find problems with the ideal model specified in the reference DTD. Alternatively, you may find problems that are due solely to the interaction between the DTD or SGML declaration and the tools. As a result, if you haven't yet branched out into variant DTDs to adapt the model to targeted purposes, you might do so at this stage. Again, the bug-reporting system should be used to record problems and suggestions.

    Section 11.4, “Testing the Use of the DTD in the Real World” discusses what to look for in testing the DTD with applications.

  5. Testing updates to the DTD

    For this job, the reviewers and testers will need not only the document analysis report and DTD maintenance documentation, but also a summary of the bugs attended to and the action taken, so they can test specifically the changes that have just been made.

11.1. Setting Up and Managing a Bug-Reporting System

Once the DTD has been finalized, the document type design team usually disbands, leaving behind the document analysis report as documentation for the maintenance work to come. This first generation of documentation is often adequate, because the team has had the time to do it correctly and has felt the need for that documentation. But two or three generations of DTD later, when the DTD has been altered and updated several times in a hurry or under pressure, it's easy to lose track of changes: what the changes were, why, when, and how they were done, who asked for them, and who did them. This problem is particularly critical when the history of changes to the reference DTD has been lost and its variants must be updated as well.

The best way to avoid such a problem is to implement a change control process compliant with the ISO 9000 quality assurance standard. It will ensure the quality assurance certification of your department, and it will help tremendously in keeping track of the history of your DTDs.

The process and its documentation are tightly interwoven; the process cannot be completed without being documented. The change control process consists of:

  1. Defining a change request form for collecting bug reports and enhancement requests.

  2. Writing a procedure on how to fill out the form, whom to send it to, and how it should be processed.

  3. Having someone receive all the forms, and number and process each form before giving feedback to each claimant.

  4. Producing, updating, and circulating a compiled bug and enhancement list.

  5. Having a change control board gather regularly to decide what must and what needn't be done. The action column of the bug and enhancement list is then completed and a rationale of the decisions documented.

  6. Incrementing the revision number of the updated bug and enhancement list (corresponding to the DTD and documentation generations), sending copies to everybody involved, and archiving all the documents listed above.

With such processes and documents there is no chance of losing track of what happened during the history of the DTD. Even if the people involved have changed, continuity is assured.

Figure 11.1, “Change Request Form” shows a blank sample of a change request form.

Figure 11.1. Change Request Form

Change Request Form

The bug and enhancement list is a summary of the requested actions and of the actions taken. It must indicate how to refer to the corresponding forms for more information. It is useful to formalize it as a table, where each bug is described in one row. The list as a whole must be identified with its list revision number and the date of issue. Each row of the table should show a selection of extracted information about a problem, as shown in Figure 11.2, “Bug List for SGML Project”.

It is useful to indicate visually which problems have been fixed, for example, by shading those rows or putting a mark in front of them. It saves time in DTD change control meetings when the attendees need to review only the marked rows, either to find out why the decision has not been implemented yet, or to make a decision about what should be done.

Figure 11.2. Bug List for SGML Project

Bug List for SGML Project

11.2. Validating the DTD

The most superficial kind of DTD testing you must perform is the validation of the DTD markup declarations themselves. “Validation” means different things for DTDs and document instances. Validating a DTD ensures that its markup rules are well formed, making it possible for instances to follow them, whereas validating a document instance involves checking to see that the rules were followed in that particular document. Before you can create realistic instances, you must validate the DTD.

In general, the implementor can do this validation alone and can resolve problems without input from the design team. A few problems may need design team communication, as described below.

To perform the main part of this validation, you need a validating parser. There are several public-domain parsers available, and most commercial SGML software products contain a validating parser as well. You may also need a rudimentary document instance containing a minimal amount of content and markup, because the parser may not expect to receive just a DTD as input. If you have implemented your DTD using computer-aided DTD development software, it is probably syntactically valid already because the software ensured that only valid DTD constructs were created.

Note

Subtle differences exist in the error-reporting behavior of various commercial and public-domain parsers, and at the time of writing, no single test suite exists yet for assessing the conformance of validating parsers to the ISO 8879 standard. It is a good idea to validate any DTD with multiple parsers to ensure that it contains no errors and that it will be usable across SGML systems.

The following checklist identifies errors commonly made in typing and constructing DTD code that will be found by validating parsers.

  • Unbalanced or Mismatched Pairs

    Check for unbalanced or mismatched pairs of angle bracket ( < > ) markup declaration delimiters, parenthesis [ ( ) ] group delimiters, single ( ' ) or double ( " ) quotation mark literal delimiters, or double-hyphen ( - - ) comment delimiters.

    <!-- The following is used in several lists--but you
    can change it if you need to.->
    <!ENTITY % listcontent "item>
    ⋮
    <!ELEMENT  list    - - ((%listcontent;)+ >
    <!ATTLIST  list
            type            NAME            #IMPLIED
            security        (open|confid)   open
    

    The comment starting on the first line has two problems: The final delimiter has only one hyphen, and the comment text contains a “dash” made of two hyphens, which will be interpreted as a comment delimiter. The entity declaration is missing a closing quotation mark. The element declaration is missing a closing parenthesis. The attribute definition list declaration is missing a closing angle bracket.

  • Missing Exclamation Point

    An exclamation point ( ! ) from the opening markup declaration delimiter may be missing.

    <!ELEMENT model-number - - (#PCDATA)>
    <ATTLIST  model-number
            id              ID              #IMPLIED
    >
    
  • No Omitted-Tag Specifications

    The omitted-tag specifications may be missing from an element declaration when the OMITTAG feature has been set to YES, or their presence on an attribute declaration.

    <!ELEMENT address     (street, city, country, postcode?)>
    <!ATTLIST address - -
            district        NUMBER          #REQUIRED
    >
    
  • Zero Instead of “O

    The use of a zero (0) instead of a lowercase or uppercase letter O in an omitted-tag specification will cause an error.

    <!ELEMENT joke - 0 (para+)>
    

    Some people use lowercase o's exclusively because they are easier to distinguish from zeros.

  • Parameter and General Entity Confusion

    You may discover a missing percent sign ( % ) or the absence of a space after the percent sign in parameter entity declarations, and conversely, the presence of the percent sign in general entity declarations.

    <!ENTITY %listcontent  "item">
    
    <!ENTITY commonattribs "id  ID  #REQUIRED">
    
    if fruitbat-article contains a fragment of a document instance:
    
    <!ENTITY % fruitbat-article SYSTEM "fruitbat.sgml">
    
  • Multiple Declarations

    Only one attribute definition list declaration is allowed for each element.

    <!ELEMENT glossentry (term, def)>
    <!ATTLIST glossentry
            id              ID              #IMPLIED
    >
    ⋮
    <!ATTLIST (dictentry|glossentry)
            id              ID              #IMPLIED
    >
    

    Some SGML-aware products allow multiple declarations for attribute definition lists by concatenating them into one “master” list, but this is nonstandard behavior.

  • Problems with Content Model Parentheses

    Beware of parenthesis content model delimiters used with declared-content keywords and the absence of the delimiters when #PCDATA is specified.

    <!ELEMENT trademark - - #PCDATA>
    <!ELEMENT indexterm - - (RCDATA)>
    
  • Forward References to Entities

    Entities must be declared before they are referenced.

    <!ELEMENT docinfo  - - (%metainfo.mix;)+>
    ⋮
    <!ENTITY % metainfo.mix  "title|ISBN|partnumber|author">
    

    This error can indicate problems with the organization of the DTD's modules. Modularizing and parameterizing a DTD are discussed in Chapter 10, Techniques for DTD Reuse and Customization .

  • Ambiguous Content Model

    Parsers will report an error for ambiguous content models.

    <!ELEMENT division - - (title, para?, para+, subdiv*)>
    

    In this case, the same logical model could have been achieved with the following declaration.

    <!ELEMENT division - - (title, para+, subdiv*)>
    

    However, ambiguity errors can indicate problems with the markup model that need the attention of the document type design team. Handling content model ambiguity is discussed in Section 8.2.1, “Handling Specifications That Specify Ambiguous Content Models”.

If your validating parser indicates that an error appears on a certain line of your DTD, work backwards from that point to see what might be wrong. Sometimes the problem can be far removed from the apparent error.

Many parsers return warnings for DTD constructs that are deprecated, if not actually prohibited. For example, you might see warnings for the following:

Not all DTD problems are purely “syntactic” in origin. For example, it is perfectly legal to specify the following element declaration, which creates a content model that is impossible to satisfy in an instance.

<!ELEMENT division  - - (title, para*, division+)>

Because every division must contain a lower-level instance of itself, no document can ever contain a valid division element. The only way to discover this situation is to test the DTD with a document instance that actually contains a division element, which will result in an error. Therefore, an important part of validating a DTD, especially if you are inexperienced at implementing DTDs, is validating one or more simple documents that contain all of the configurations of markup that the DTD allows.

11.3. Validating the Markup Model

To test the validity of the markup model represented by a DTD, start by marking up the sample documents that were used in the analysis and design work. This provides an overall sanity check that nothing was forgotten. In addition, mark up both typical and unusual sample documents that weren't used in the analysis, as a second-level check. As the use of the DTD is rolled out in your organization, problems will likely continue to arise.

Two basic types of problems can be found. The markup model might not accommodate a document that falls within the scope (that is, it is wrong or overly constrained), or it might accommodate documents that fall outside the scope (that is, it is overly broad). The consequences are different for each.

There are two main ways you can use marked-up documents to validate the markup model:

  • Conduct a “code review” of marked-up documents to see if they meet your expectations and to determine which markup never gets used (useful with both automatically and manually marked-up documents)

  • Conduct a “contextual inquiry” with people who are manually marking up documents

    This is a technique that involves observing the authors and asking questions as they work. It can be a highly effective way to determine where markup is missing. (It is also useful for usability testing of the DTD, for example, where there are many poorly distinguished choices. Usability testing is discussed in Section 11.4.2, “Usability with People”.)

Later testing with real-world applications often highlights additional subtle problems in the markup model. For example, say you haven't allowed for marking up model numbers, but you soon find that without being able to locate model numbers easily, you can't generate cross-reference lists of machine parts that contain other parts. A thorough design process will have helped avoid this. However, if you find you must make additions to the markup model at this late date, it's possible to do. (Of course, you may need to put your SGML documents through an additional conversion process in order to take advantage of the added markup.)

11.3.1. Wrong or Overly Constrained Model

If your markup model is wrong or restricts markup options too severely, you may not have enough markup in your document files to support the kinds of utilization you want. Further, if authors will commit Tag Abuse just to make the data fit or to get the kind of formatted output they expect, the documents may have an inconsistent and idiosyncratic use of markup that can damage your ability to process them in any useful way.

If testing reveals that the markup model doesn't account for all the documents' needs, the model may need to be broadened or changed, or existing documents may need to be rewritten or restructured.

For example, if the model allows for chapters to be grouped into parts, but a document is then found that also groups its appendices into parts, the model may need to change to accommodate such documents. On the other hand, you may discover that the document with the grouped appendices was written poorly or violates a corporate style guideline, in which case it needs to be determined whether legacy documents with this problem can be rewritten to conform to the new, more “correct” model.

11.3.2. Overly Broad Model

The detection of an overly broad model is harder than that of an overly constrained model, since legacy documents don't need to be “forced into the mold,” and the drawbacks of leaving it too broad are harder to quantify. Following are some possible scenarios.

If certain portions of your markup model are overly broad, validating parsers may not catch documents that violate style guidelines or structural requirements. For example, if every product description should have a part number supplied, but there is no requirement for the part number element to appear, documents may make it all the way to the delivery stage without a part number. Often, markup model standards such as this are relaxed in order to ease authoring and the conversion of legacy documents. If possible, it's better to make variant authoring or conversion DTDs for these stages of document processing, so that the proper validation can be done in the later stages.

Note

Some structural requirements are not able to be checked by validating parsers. For example, if your paragraph element allows #PCDATA, it's possible to have paragraph start- and end-tags with nothing between them. If it's important to perform this kind of validation, applications outside of the validating parser must perform them.

If the model provides multiple ways to mark up the same data, authors may be confused and are likely to mark up documents inconsistently, possibly harming your ability to process the documents in the desired ways. This situation tends to occur in DTDs with a broad audience. For example, if the model offers two different locations where the document title can be stored, some authors (and applications) will use one and some will use the other. If possible, eliminate all cases of duplicated markup functions. If this can't be done, try to construct the model so that if one is present, the other can't be supplied. At the least, clearly document the precedence of usage and the processing expectations if both are used.

If the model offers many more elements than will be used on a regular basis, your authors may have an unrealistically steep learning curve, which can discourage the use of proper markup. You may find that the model has a variation of the problem of multiple markup methods—forms of markup that are so similar, or have such a fuzzy boundary between them, that there's no consistent way to use each one appropriately. It's often the case that DTDs offer many more data-level elements than can be used distinctively; for example, software documentation DTDs might offer several elements for command-level computer instructions, command argument keywords, environment variables, and so on.

If it is possible to make a case for distinguishing the elements clearly, the documentation and training must provide the means for authors to test their subject matter knowledge against the model and choose appropriately. If it turns out not to be possible, the overall number of choices may need to be reduced, which should be done with the design team's help. Using software to collect statistical data on markup usage, as mentioned in Section 13.6, “Phase 4: Quality Inspection of Documents ” in connection with quality inspection of documents, could be useful in identifying markup that never gets used.

11.4. Testing the Use of the DTD in the Real World

Once you're satisfied that the DTD is powerful enough to represent the documents in its scope without being too broad, the markup model can be said to be valid. However, this is not the same as saying that the whole SGML application, including both the DTD and the SGML declaration, is usable by processing applications or people. You need to test the them with the intended users and applications to discover any inefficiencies or processing-related problems in the SGML application itself or in its interaction with your chosen tools, which might occasion the need to create variants for different purposes.

11.4.1. Usability with Applications

To test the usability of your SGML application with processing applications, you need to do two things:

  • Make sure the application developers review the document analysis report and draft DTDs to catch potential problems early.

  • Test the DTD with the applications.

Following are some of the problems you might encounter:

  • The DTD may depend on optional SGML features, such as SUBDOC, that aren't supported by your tools. This is a problem with the tool rather than the DTD, but the latter may need to change to accommodate the former.

  • Your tools may have resource problems related to the complexity of the documents; for example, an average document may need to nest elements to a depth of 100 levels (a fact that will need to be represented by the TAGLVL quantity in the SGML declaration, described in Section A.9.4, “Concrete Syntax”), but your tools can't handle this depth of nesting. Again, you may need to create a variant DTD that can be handled by your tools, if the tools' processing ability can't be changed.

  • Your tools may have difficulty querying on your document database to find relevant information because there is too little markup. This might be a problem in the DTD (or it might be a problem relating to authoring).

  • Your stylesheets don't have the ability to do all the kinds of formatting you want, without additional presentation-specific markup in the DTD to help them. This situation might call for a presentation DTD so that the reference DTD doesn't contain this markup.

11.4.2. Usability with People

Assuming that the testing discussed in Section 11.3.2, “Overly Broad Model” has found the markup model not to be too broad, it may still be the case that the DTD is difficult to use or that it makes the process of choosing markup too difficult. Either of these conditions will probably result in Tag Abuse by authors, and therefore, documents that are not marked up fully, accurately, or consistently. These results may be due to the DTD itself, to the interaction between the DTD and a particular authoring tool or environment, or to problems with the amount or quality of the DTD documentation (discussed in Chapter 12, Documentation) and training (discussed in Chapter 13, Training and Support).

The testing methods discussed in Section 11.3, “Validating the Markup Model”, code review and contextual inquiry, work equally well here.

Following are some of the problems you might encounter. In each case, you might respond either with changes to the authoring DTD or with changes or customizations to the authoring tools and environment, depending on your resources and preferences.

  • The DTD may be very large, with too many choices of elements in various contexts for authors to handle well.

    Even in an SGML-aware authoring tool, the number of element choices in some contexts might be overwhelming. For example, a typical context that contains information units might allow thirty elements, with eight of them being different types of list. One solution might be to subset the DTD to just the portions that authors actually need. If subsetting doesn't help, ensuring that elements in the same class are grouped together when presented to an author can be useful. You might accomplish this in one of several ways:

    • Actually changing the generic identifiers in the authoring DTD so that elements will be presented alphabetically in the appropriate classes (list-bulleted and list-numbered instead of bulleted-list and numbered-list)

    • Aliasing” the elements in the authoring tool

    • Customizing the tool to produce a special dialog box that presents often-used elements first or presents the elements by class

    • Creating wrapper elements in the authoring DTD to present fewer choices to the author until a class is selected (adding a list element that contains one of the eight kinds of list)

    • Mapping special key presses or commands to actions that insert templates containing commonly used elements

  • The tags may be too long, or the element names too hard to type or read.

    Obviously, one solution is to change the element names. If there are reasons to keep the original names, however, there are other solutions. It may be possible to “alias” the names to better ones in the authoring environment, or minimization techniques can be used to reduce the amount of typing.

  • Some content models, which may be suitable for finished documents, may be too restrictive for use during the authoring process.

    If authors are continually frustrated by validation errors in their partial or draft documents, and the errors add nothing to their ability to produce high-quality documents, then the authoring DTD may need relaxed versions of the relevant content models.