Chapter 4. Document Type Needs Analysis

Table of Contents

4.1. Preparing for the Design Work
4.1.1. Learning Basic SGML Concepts
4.1.2. Learning to Recognize Semantic Components
4.1.3. Learning the Tree Diagram Notation
4.1.4. Scoping the Work
4.1.5. Planning to Prepare Deliverables
4.1.6. Learning About Teamwork Norms
4.1.7. Gathering Analysis Input
4.2. Performing the Needs Analysis
4.2.1. Step 1: Identifying Potential Components
4.2.2. Step 2: Classifying Components
4.2.3. Step 3: Validating the Needs Against Similar Analyses

The document type design team will go through four phases of work: preparation, needs analysis, modeling, and reporting. This chapter discusses the first two, and Chapter 5, Document Type Modeling and Specification discusses the last two. For the purposes of explaining the phases and steps, we'll assume a project where a new document type is being designed. For information about the process of customizing an existing model and how it compares to the process for a new DTD, see Section 7.1, “Customizing an Existing DTD”.

Analysis involves identifying potential needs, classifying them into logical categories, and validating the needs against analyses and models that have already been prepared. Here, “needs” refers to descriptions of the basic intelligence that the documents must encode, rather than a specific SGML-based model for their required organization or hierarchical structure.

As a simple example of analysis work, say that the team members notice there are document constructs that are conventionally called “procedures[7] and “steps” in the documents being examined. They first name and define these constructs clearly. Then, as they proceed, they group the constructs with other similar constructs that are discovered, such as “lists” and “items,” so as to be able to consider their status and their design requirements in tandem. Finally, they compare their understanding of these constructs against the work that other people have done with procedure constructs.

The output of the analysis work lists, describes, and categorizes the distinctions potentially needed in the document type. In the modeling activity, the team will decide which distinctions should be addressed in the project and will use SGML modeling techniques to devise a cohesive markup model that takes into account each of the accepted distinctions.

The natures of the analysis and modeling phases differ in two important ways:

Section 4.1, “Preparing for the Design Work” discusses how to prepare for the document type design work, and Section 4.2, “Performing the Needs Analysis” discusses how to perform the actual analysis.

4.1. Preparing for the Design Work

The team's preparation has the following parts:

The group-oriented training and preparation can take from less than one day to as much as three days, depending on the experience the team members already have with structured markup, the clarity of the project goals, and the availability of analysis input. Much of the gathering of input may be able to be done outside the group setting.

4.1.1. Learning Basic SGML Concepts

The members of the design team need to receive some minimum level of SGML conceptual training covering its basic philosophy, the strengths and limits of SGML technology, and the rudiments of SGML markup.

The following technical topics are likeliest to be helpful at this stage.

  • DTDs as sets of rules

  • Elements, including the various content model configurations for occurrence, sequence, and exceptions

  • Attributes, including ID/IDREF linking mechanisms

  • Entities and the notion of non-SGML data notations (Figure A.3, “Functional Entity Types” shows the functional classes of entities in a way that may be helpful to nontechnical people)

  • Character data

The introductory material in Chapter 1, Introduction to SGML illustrates a useful level of conceptual training, though not all the topics listed are discussed there. Advanced topics such as markup minimization won't contribute much to the kind of understanding that the team members need, and may even be confusing or overwhelming at this stage.

Following is some advice for avoiding common mistakes and making the best of the training opportunity.

  • Avoid Training on SGML Syntax and DTD Implementation

    Unfortunately, an emphasis on syntactic details can spoil the outlook of people who are technically minded. If everyone on the team already thinks of themselves as DTD-writing experts, some “unlearning” might even be in order, so that enthusiasm for clever content modeling can be kept to the implementation stage, rather than letting it intrude into the analysis and modeling stages.

  • Don't Study Other DTDs Yet

    We do not believe in having the team members study other DTDs before starting to define their own needs. Experience shows that they become so influenced by what they have read that they tend to think only in terms of what they already know, which negatively affects their contributions to the project at hand.

  • Consider Using Structured Editors in the Training

    It can be helpful to give the team members an awareness of the markup responsibility placed on authors by letting them play with an SGML-aware editing tools and a standard DTD as part of the training—even if there are no plans to use structured editing tools in the real environment.

4.1.2. Learning to Recognize Semantic Components

The team will spend a great deal of its time working with units of specification that we call semantic components . A semantic component corresponds to an expressed need for a distinction between one kind of data and all others. For any proposed component that has been accepted, the result will be SGML markup that reflects the difference in some way.

To illustrate, we can take an example from phonology, the science of speech sounds in natural languages. A “minimal pair” of words is used to elucidate which sounds are significant in a language. For example, in English, “pack” and “back” mean two different things, proving that the differences between the p and b sounds are significant, even though they are produced in a physiologically similar way (differing only in vocalization). On the other hand, “rack” with an American-sounding r and “rack” with a rolled French-sounding r mean the same thing in English. Thus, English has only the three “sound componentsp, b, and r. Given appropriate sets of word pairs in another language, it might be determined that the four unique sounds (p, b, and the two kinds of r) represent a different set of “sound components.

Likewise, in examining document information, if two pieces of information should be considered to be the same kind of information, they form examples of only one semantic component. But if they mean something different, or should be treated differently by applications or represented differently to readers, they form two components, a “minimal pair” of kinds of information.

The difference between sounds in natural languages and pieces of document information is that the document type design team has a choice about what to consider different or the same. The basis for making this choice is the motivation to do so, based on the kinds of desirable utilizations that become possible or are facilitated when a new distinction is made.

An everyday example shows how important such motivation can be in determining the richness of choices. In warm climates, people give little thought to frozen precipitation, and can usually get away with using the one word “snow” to talk about the subject. In colder climates, people who drive cars need to know more about snowy conditions, and listen avidly to radio reports that tell them how deep the snow is, whether it is slushy, windblown, icy, and so on, and whether sleet, freezing rain, or hail might also be approaching. People who ski may want to determine whether each of the layers of snow on their favorite mountain is machine-made, machine-groomed, powdery, granular, loose, packed, and so on, along with how deep it is. The point is that if you care enough about the information (or about what you plan to accomplish once the information is in your possession), you can make amazingly fine distinctions. Conversely, if you don't care about the information, the distinctions can get in your way—in other words, they can be too costly to be worth making.

The equivalent, in terms of document type needs analysis, would be to compare two kinds of information and see if they're really different. For example, you might weigh having several different kinds of procedures, for example, installation versus troubleshooting procedures, against having only a single kind. It's not right or wrong to have one versus several dozen semantic components for a particular category of information; the choice simply depends on your focus for use of that information.

It may seem to you that a semantic component corresponds directly to an SGML element type, and once a team has moved on to the modeling phase, a one-to-one mapping is often the result. However, SGML markup distinctions can be made in a number of ways: with a new element, with an existing element made available in a new context, or with an attribute value, for example. The goal of analysis is only to identify the desired distinctions. The team will have the opportunity to choose their expressions as elements and attributes in the modeling phase, even if the choice seems intuitively obvious earlier in the process.

How do team members identify potential semantic components? They must extract them from sample documents and other sources. This process requires having a certain outlook that accords with the “generalized” philosophy of SGML. The following sections discuss concepts and techniques that can be useful in teaching team members to recognize potential components. Recognizing Content, Structure, and Presentation

Components can reflect differing amounts of information about “meaning” versus formatting, and can be described by reference to the following categories:

  • Content-based components indicate what the information is (or represents) in an abstract sense, and avoid implying what its ultimate appearance will be. Content-based components most closely model things from “the real world” in SGML.

    Following are examples of relatively content-based components:

    Addresses, streets, and postal codes
    Machine part numbers, quantities, and prices
    Software command names and descriptions
    Recipes, ingredients, temperatures, and preparation times
  • Structural components make their distinctions by relying on basic characteristics of their structure and, usually, on the presentational traditions surrounding print-based typesetting and publishing. (Some people actually prefer the term “publishing” rather that “structural,” but here we are following common usage in the industry.) Structural components are often the workhorses of markup that authors use most when composing straight “document text,” but these components don't expand many horizons of document utilization.

    Following are examples of relatively structural components:

    Lists and list items
  • Presentational components describe precisely how the information should look, without implying any notion of meaning (in the real-world sense). Its entire definition can be said to consist of processing expectations—requirements and constraints on the formatting or other processing of the document content. Presentational components can harm a document's ability to be utilized in various creative ways and are usually antithetical to the SGML philosophy; their place is usually in the stylesheets or other applications that process SGML documents for output.

    Following are examples of relatively presentational components:

    Phrases that have a specified font or point size
    Regions to keep on a line or page
    Places to break a line or page
    Indented regions

People looking at information can interpret it in any of the three ways (or as some combination), which can result in the identification of different components. Because of this possibility for variation, the design team members need some guidelines on which to base their interpretations. First, they must realize that any information about a component that has not been made explicit in the component's definition is unpredictable—not consistently available to users or applications. For each potential component, the team needs to ask “What characteristics of the information should be most predictable?

Following are some examples of alternative interpretations of component collections.

Content-Based Interpretation Structural Interpretation Presentational Interpretation
Machine part description entry, machine part number, quantity of part in stock, price of part Table row, cell in row Three-column text aligned horizontally, with each horizontal group separated by a 1-point rule, the second column starting two inches in and the third column starting four inches in, and with a dollar sign preceding text in third column
Copyright statement Paragraph Block of text in 8-point type
Software command name mentioned in text No equivalent Bold phrase
No equivalent No equivalent Characters that should be in small capitals to match a particular trademark owner's desired appearance
Introductory chapter Division and division title with the content “Introduction The word “Introduction”, on its own line, in 24-point Helvetica type

On what basis can a team decide among the interpretations?

The content-based components allow for prediction of what the information means, but the information's appearance may be variable. For example, command names could just as easily be rendered in Courier font as in boldface. For another example, machine part information could be presented as selective hits returned from a search rather than as static tables, or could be arranged in a nontabular way. If the project goals include multiple presentations for the data or flexibility in retrieval or processing, content-based components are best.

The structural components allow for prediction of the information's organization at a crude level, but little or nothing is known about the contents; for example, tables could contain baseball scores as well as machine part information. Also, the presentation can be somewhat variable; for example, for any one table, indent levels and horizontal and vertical ruling could be applied entirely differently. If the project goals require only a simple treatment of data, structural components can serve acceptably while protecting the data from containing unnecessarily presentational markup.

The presentational components allow for prediction of the information's appearance, but its meaning and even its structure may be obscure. For example, it's easy to see that a phrase is bold, but it's impossible to tell why it's bold, so that if you want to index command names automatically, you can't safely set up an application that finds and indexes all the bold phrases. Presentational components are best where authors absolutely need control over appearance in order to provide accurate document data (for example, providing verses of poetry in units of “lines”); otherwise they should be relegated entirely to stylesheets and other processing applications.

Because content-based components impose relatively greater amounts of overhead on the document creation process—as the examples above demonstrate, they're typically more numerous and complex than structural components—they are most appropriate for the information that falls within the document type's specialized domain. For example, cookbooks don't usually contain complex postal addresses. If some addresses appear in a cookbook, the design team can fall back on a structural component, say, a “general-purpose display,” to hold the foreign information.

In general, opportunities for content-based components shouldn't be passed over lightly because these components, when appropriately turned into markup models, can help you increase the processing potential of your document data. However, often the simplest structural components can bring benefits to documents that were once encoded with system-specific markup, and they may also be more readily accepted by users.

Before people must identify real components, they need some practice. It can be helpful to examine documents from an information domain that is far removed from the one that will be under discussion, so that attitudes towards decision-making aren't frozen into place early in the project. We've had success using restaurant take-out menus and newspaper articles in training sessions. For example, examine the menu fragment shown in Figure 4.1, “Restaurant Menu for Component Exercise”.

Figure 4.1. Restaurant Menu for Component Exercise

Restaurant Menu for Component Exercise

Some people see the menu as having a general-purpose structure, with a repeatable structural component called “menu section” and another component called “menu section title” labeling the contents: soups, steamed dishes, and so on. Other people see a content-based component for “soup” information, another for “steamed dish” information, and so on. Depending on the goals of the project, either interpretation could be valid.[8]

People who use WYSIWYG systems commonly interpret the menu as a document formatted in one column, with text appearing in various fonts and point sizes and with a graphic symbol next to certain lines. However, these aspects of the menu are all transient. The literallayout and fonts could be changed without making the document content unrecognizable as a menu. Therefore, such items can often be rejected out of hand as true semantic components. If there is doubt about any one of them, it can be recorded as a potential semantic component and then examined more closely. Likewise, the numbers next to the dishes and the special symbols that indicate spicy dishes are artifacts of the formatting characteristics of this printed version of the menu. Different numbering systems such as letters (or no numbering system at all) could have been chosen instead. Spicy dishes could have been indicated with red type, stars, or small chili-pepper graphics.

The goal of the exercise is to draw abstractions at least to the point of recognizing the notions of “menu entry,” “dish name,” “dish price,” “dish spiciness,” and so on, rather than those of “number,” “line,” and “graphic symbol.” (Note that if the menu being examined had used a rating system for indicating exactly how spicy a dish is, a “level of spiciness” component would be advisable, rather than just a component for “whether the dish is spicy.” This more precise component might need to be considered even in the current case, if the planned utilizations suggested it.)

The following checklist can help team members uncover potential semantic components.

  • Discard Purely Presentational Items

    If a formatter can provide a piece of printed information automatically and the information or its presence changes depending on the chosen literallayout, it is probably presentational and not necessary to consider as a semantic component.

    Examples: page numbers and regularly appearing product logos.

    Typically, processing applications can automatically add presentational effects based on other markup that will be present. If you're unsure, you can retain the component until the late stages of modeling and then ask the application developers to review the situation.

  • Identify Single “Source” Components for Repeated Information

    If copy editors are currently checking laboriously for consistency of information when it is printed in multiple locations, the information probably needs a single component that records the information once, so that a formatter can retrieve it and output it in each location.

    Examples: all running text, such as headers or footers that repeat a chapter title, print the release date of a product, or display the document's security status. Also, tables of contents that repeat the titles and captions of blocks of information in the main body.

  • Identify Components for Retrieved Information

    If a piece of information can be retrieved from an existing or planned database rather than being typed in and checked by copy editors, the information most likely needs a content-based component.

    Examples: part numbers, trademarked terms and their owner-attribution notices, and glossary entries. Recognizing Nested Containment

The notion of hierarchical containment is frequently a difficult one to understand, especially if many of the team members are familiar only with WYSIWYG word-processing systems. It can be helpful to practice finding nested organizations of components by circling them directly on a printed document.

For example, people with only WYSIWYG experience might at first view the menu text pictured in Figure 4.2, “Restaurant Menu with Flat Structure Identified” as a flat series of titles, various types of dishes, and descriptions, as indicated by the circles.

Figure 4.2. Restaurant Menu with Flat Structure Identified

Restaurant Menu with Flat Structure Identified

For purposes of uncovering SGML requirements, a more useful way to view the same text would be to recognize the larger—and smaller—logical containers, as shown in Figure 4.3, “Restaurant Menu with Nested Structure Identified”.

Figure 4.3. Restaurant Menu with Nested Structure Identified

Structure nested structure Nested structuremenu exampleRestaurant Menu with Nested Structure Identified

Often, for divisions of document content such as sections and chapters, the markup system that was used previously would call a section title a “heading” with a level number attached, often leading to the impression that a section is merely the section heading. It can be an eye-opener to show how the entire section can be considered a large container that associates all its content with the heading text.

Most word processors have no notion of “styles” that contain other “styles,” so the concept of containers with only other containers inside them can be a major stumbling block. Many SGML-aware editing products have capabilities that can illustrate the power of this kind of containment in dramatic ways—for example, collapsing and expanding containers at arbitrary levels, or simultaneously displaying both an expanded view and a collapsed view of the same document. Making such software available as part of the team training can be extremely effective, even if the software used for training will not be used in actual document creation.

The following checklist can help team members uncover nested organizations of components.

  • Identify Labeled Components

    If you have found a descriptive title or label, consider whether there should be a component for the information that is described by the label.

    Examples: titles on document divisions, notes to readers, and figures, which imply the presence of components for the divisions, notes, and figures themselves.

  • Identify Container Components for Whole Series

    If “one or more” components can occur, consider whether there should be a component that contains them.

    Examples: almost anything. If you tend to describe the individual pieces as a “series,” “set,” “collection,” or “list,” it is almost certain that you need a containing component with that name.

  • Identify “Block” Components

    If several components appear in a group and have a significant order, consider whether, taken together, they represent a higher-level component.

    Examples: pairs of glossary terms and definitions, which may imply the presence of a glossary entry component. Learning About Tag Abuse Syndrome

Authors and others in charge of marking up documents are prone to a condition that we call Tag Abuse Syndrome, which manifests itself in using and applying markup improperly. Often the cause is time pressure or lack of training, but a poor DTD design will exacerbate the problem. Either way, the result is a poor-quality information base that can't be exploited in the ways originally intended.

The usual symptoms of Tag Abuse are as follows:

  • Choosing markup solely for its formatting effect in a certain processing environment (for example, using an publication-date element for copyright information because the style sheet happens to output date information at the lower left corner of the title page, a location that an author might prefer for the copyright notice)

    Failing to mark up information for which proper markup exists (for example, using a paragraph element that begins with the string “Note: ” rather than using a note element)

  • Using several different kinds of markup inconsistently because they produce the same formatting in a certain processing environment (for example, switching back and forth between emphasis and foreign-phrase elements to get simple italic text, even if the word isn't foreign)

  • Using multiple elements indiscriminately because the author doesn't understand the distinction (for example, using copyright and trademark elements as if they were the same)

The team members should understand how their work can have an effect on Tag Abuse. A DTD can encourage good markup practices by offering the right amount and depth of markup for the job, with only one way (where possible) to mark up any one kind of information. For now, the team members simply need to be aware of the need to justify the presence of all individual components that they propose in step 1, and to justify them in more depth when they select components in step 4, because excessive or duplicated markup can harm rather than help the documents.

4.1.3. Learning the Tree Diagram Notation

The tree diagram notation first shown in Chapter 2, Introduction to DTD Development will be an important part of the analysis and modeling efforts. The team should be introduced to tree diagrams at this point, though they won't yet be doing actual modeling tasks that use the notation. (They might use it to document existing DTDs that bear on step 3 of the analysis work, however.)

This book contains many examples of tree diagrams, and Appendix B, Tree Diagram Reference provides reference information on the notation.

4.1.4. Scoping the Work

If the design team hasn't yet been presented with the specific reasons for the SGML project, this presentation should be given now. The team needs to prepare an initial list of design principles, statements that will guide the team's decisions. If no recordist has been assigned yet, the team needs to choose someone to fulfill this duty. The assignment can rotate among team members.

The initial principles set the tone for the work of choosing potential components, and they also contribute to the building of derivative principles later on. Each design principle must be written down concisely and must be agreed to by the entire team. If agreement proves impossible, the team will fail in its mission to design a cohesive set of document type requirements because the goals will be ambiguous—and each team member will work towards only the set of goals he or she personally agrees with.

The following questions can help distill the project information into a few brief answers. It's a good idea to assign each principle a short identifying label, as shown here, so that it can be referred to conveniently during the analysis and modeling work and from within the document analysis report.

  • What documents fall under the scope of the SGML project?

    For example:

    Scope: The project covers user documentation and internal specifications for the computer systems we sell.

    The scope principle often gains additional detail during the analysis process, as people suggest marginal components. In this way, the team can avoid wasting time on potential components that are clearly outside the scope. For example, questions such as the following might arise: “We said documentation was included; are training materials included? How about quick reference cards?

    Note that the number of “document types” as such won't be determined until after the modeling phase is complete. For example, the scope principle won't state whether user guides, reference manuals, and product specifications are one document type or three.

  • What are the immediate and potential uses for the documents? What is the “least common denominator” utilization that would place the most constraints on the markup model? What are the ultimate goals for our use of SGML for these documents?

    For example:

    Document utilization: The manuals will be exchanged with business partners in draft form and will be prepared for both book (paper) and simple hypertext (CD-ROM and Internet) delivery. Eventually, we plan to generate sophisticated hypertexts from the same information.

    The document utilization principle will guide the team in determining where on the “content–structure–presentation” continuum the components must generally fall.

    It's important to clarify the boundaries for document utilization early on. Often, the “visionaries” on the team suggest uses of the document data that are so futuristic that they aren't yet supported by arguments for a return on the SGML investment. On the other hand, the “pragmatists” often feel uncomfortable with innovative uses for document data that they haven't yet seen with their own eyes. (A person can be a visionary about some topics and a pragmatist about others!)

  • What will the method(s) for document creation be? What is the “least common denominator” creation method that would place the most constraints on the markup model?

    For example:

    Document creation: Articles will be created and marked up with unstructured text editors such as vi. Whole journals will be assembled by the journal editor with the help of SGML-aware authoring software.

    The document creation principle shouldn't unnecessarily constrain the analysis and modeling work, but can be used to let the DTD implementor know about conditions that can affect variants or adapted versions of the reference DTD, as well as markup minimization requirements.

  • Who is in the audience for the DTD(s)? Who will use the DTD in document creation?

    For example:

    DTD audience: The DTD(s) we develop will be used by people in the Documentation and Marketing departments of our company, as well as those same departments of our business partners in the Widget Consortium. Also, our data entry subcontractor will use the DTD.

    The audience principle can affect the modeling decisions made later about how lax or enforcing the element content models are and can help the DTD implementor determine how to prepare the DTD for customization. The extent to which Tag Abuse is expected to be a problem might shape the audience principle or might suggest the creation of another principle that demands a certain level of proof for components.

  • What is the model for management of the documents?

    For example:

    Document management: Chapter-sized chunks will be stored in the database for assembly into whole manuals. Databases for legally required notices will be used in building books so that we can meet our legal obligations.

    The document management principle can contribute to the discussions about how content-based the document type design should be. It can also help the DTD implementor determine how many DTDs to build and what their architecture should be.

4.1.5. Planning to Prepare Deliverables

In every document type design team in which we have taken part, we have reached a point where we had to ask ourselves: “Why did we make this decision about this component?” Everyone on the team was a well-trained professional and each decision had been carefully weighed at the time it was made. However, a month later, no one could remember what the reasoning had been. We would then start the discussion all over again, and whether or not we came to the same decision, it was usually wasted time.

Therefore, you can imagine the situation several months or even years later, when all members of the design team have scattered and the original DTD implementor is gone. In this case, oral tradition is not very helpful. What you usually have is a pile of bug reports and enhancement requests from the current users of the DTD and nothing to compare them with, nor any rationale for why things were done as they were originally.

This is why it's crucial, as a first step, to record all suggestions and decisions along the way so that the team can see progress through the design work and avoid revisiting topics or decisions already discussed. This requirement forces everyone to clarify their ideas, and it helps all the team members understand quickly what the problem is about. Overall it saves the team a lot of time, and since everything is already in writing, it is easy to keep track of each intervention. (Section 3.4, “Handling Project Politics ” suggests some ways to keep the design process efficient.)

It is no use to try to keep every document flawless before the design work is done; it would be too time consuming and not at all cost effective. But when it is finally over, you need to sort, clean up, and finalize the mass of documents you have accumulated by consolidating all the information into a document analysis report. This report will be used intensively by reviewers to validate the process and the decisions that were made, as well as by DTD maintainers and application developers.

The reviewers are not SGML experts, nor are they modeling experts. The report must therefore be understandable to them (for example, by avoiding technical SGML jargon), and be practical enough to relate to their document writing experience (for example, by using familiar terminology).

Because of the importance of terminology, the design team should keep a project glossary that defines all the special terms the members use. The glossary is a useful tool for the entire project group. For example, it should define SGML technical terms used in the report to help any nontechnical reviewers, and any jargon specific to the targeted information domain to help the DTD implementor (if that person is unfamiliar with the domain) and any reviewers who work outside that domain. It should also define any terms or phrases that the team chooses for labeling the concepts it invents during the course of the work.

For example, in work we've done on a computer documentation project, we found it appropriate to define, among others, the following terms:

  • PCDATA” to help nontechnical reviewers understand the requirements that included #PCDATA specifications

  • Volume” to help DTD implementors understand the way multiple manuals were typically grouped into sets

  • Spike” and “chip,” terms we concocted to describe our complex interchange situation, to help all reviewers understand our use of these terms

Also, if you are part of an international corporation and need the requirements for the DTD to be reviewed abroad, make sure even the simplest words are well defined to avoid cross-cultural ambiguities. For instance, in the same project as the one described above, European reviewers had a hard time understanding what the American people were calling an “example.” For the Europeans, an example was a fact or a thing illustrating a general rule. However, in American “computer English,” it was generally accepted that an example was a piece of program code. Of course, no foreigner would have a clue if this meaning of the word were not documented.

After the report has been reviewed, there is often a pile of alterations and enhancements to be included in the requirement specifications. Make sure they are inserted in the report even if the implementor has started coding and you just intended to tell him/her orally about the necessary changes. You need to write down the team decision in each case, with the rationale for it, especially if the team decides against the change. The request will probably come up again soon, and it is important to know why it was refused in the first place.

While the report is being reviewed, the reviewers often ask for more information or for explanations. If these comments are received while the design team is still holding meetings, make sure to record the questions and answer them in the final report. If the question was asked, however trivial, it means that some further explanation was needed.

DTD implementors and maintainers use the document analysis report in different ways than the reviewers do. The implementor will base all of the implementation work on this report, so what is important here is not the language or the reasons why you require something, but the completeness of the document and the precision of your requirement specifications. Make sure to include in your report all the additional information you give the implementor in response to verbal or written comments and questions, since obviously this information was missing and it is necessary for the maintenance phase.

Example 4.1, “Contents of a Typical Document Analysis Report” shows the typical contents of a final document analysis report with the needs analysis documents stored in an appendix as “background material.

Example 4.1. Contents of a Typical Document Analysis Report

1. History of the Cookbook DTD Project
2. Design Principles
      Document Creation

3. Requirements for Cookbook Hierarchy Elements
4. Requirements for Recipe Elements
5. Requirements for Low-Level Elements

A. Tree Diagram Quick Reference
B. Needs Analysis
      Component Forms
      Component List/Matrix
      Element Forms
      Context Population Matrices
      Element Collection Forms
      Sample Structures

Project Glossary

4.1.6. Learning About Teamwork Norms

Successful teamwork is related both to using the steps of the methodology properly and to getting in the right frame of mind. Team members should heed the following advice.

  • Embrace Change

    In some cases it's appropriate to revisit decisions. The classification work in particular requires fluidity because, as the team's understanding of a concept grows, boundaries can shift.

  • Defer with Confidence

    It's tempting to start doing modeling work in the analysis phase. However, the purpose of analysis is to gather data, rather than to decide on any structural requirements or to make decisions about whether something is an element or an attribute. If the answers are obvious, it means only that the final decisions can be made quickly when the team is in the modeling phase.

    By waiting, you gain a more complete understanding of similar components that should have similar treatment. In this way, you can avoid many “religious” arguments about design. In addition, it's more efficient to collect all the needs before changing over to the modeling work.

  • Record As You Go

    The team is wasting time and money if people can't remember exactly what was decided previously or why they decided it. The team should get into the habit of recording all decisions and rationales, even (perhaps especially!) the obvious ones.

  • Remember the Power of Names

    As the team develops its own unique jargon for the concepts that come up in discussion, it's important to choose names for them that are intuitive and that don't have too much “baggage.” For example, if you call a section a “head,” team members and reviewers might tend to misunderstand and think this term applies to the title or heading of the section, rather than the section itself.

    Often, people from different departments or companies use different words for the same concept. Sometimes the team may need to choose terms that are halfway between one usage and another in order to avoid superficial disagreements that prevent the process from moving ahead. For example, in computer circles, some people can nearly come to blows over whether a certain kind of computer language construct is a “parameter,” an “argument,” or a “qualifier.” If all are agreed that the construct under discussion is the same thing, it might be best temporarily to pick a fourth name unrelated to the three choices, define it in the glossary, and move on.

    The process of naming should always be followed by making an entry in the project glossary so that reviewers will understand the team's reports.

  • Avoid Writing the DTD

    In both the analysis and the modeling phases, it can be tempting for people who can write element declarations to start doing so. However, it's crucial that work done by the team be understandable by all team members and all reviewers, not just by people qualified to be DTD implementors. Section 2.2, “SGML Information Modeling Tools and Formalisms” offered several reasons why the modeling work should be expressed in ways other than SGML markup declarations. However, the most important reason, as far as teamwork is concerned, is that using DTD code is unfair to the nontechnical people on the project.

    Further, a DTD is much more than a markup model. If any prototype declarations are written, it should be understood that they won't be used directly in the final product, because many other factors must be taken into account in constructing the DTD.

  • Respect All Contributions

    Expert systems that encapsulate artificial intelligence are built in much the same way as document type design requirements—extracting the knowledge of human experts with real-life experience. The cognitive scientists who build these systems have learned that the extraction process can become highly emotional and personal, because the experts have invested a great deal of themselves in their chosen field. So it is with document type analysis and modeling.

    It's important to remember that all the team members were asked to participate on the team for the unique contributions they each can make. Listen carefully to the contributions of the others and respect their right to participate, even if you disagree with what they are saying.

  • Be Systematic

    If the DTD project has been properly launched, the steps, processes, report deliveries, and review cycles have been planned and documented. Don't discard the project framework once you start doing the actual design work; sticking to the workflow roadmap will give you an advantage in completing high-quality work on schedule.

4.1.7. Gathering Analysis Input

If the information that will serve as input to the analysis hasn't already been supplied, the team members need to help gather it. After a discussion about the kinds of input that would be useful, each person should take on a “homework assignment” to locate one or more particular sources of information, become familiar with them before group-based analysis begins, and make copies of them for other team members if appropriate.

The following checklist suggests possible sources of analysis input. The team will need to determine the importance of each source to the project goals.

  • Document Samples

    Unless the scope of the project describes a small, closed set of documents (such as a set of ten ancient literary texts), it's impractical to analyze all the documents in the scope. If there are many other sources of input, the team can choose fewer sample documents, but if there are few or no other sources, the team needs to analyze as large and diverse a set as time constraints allow. If no documents exist yet, collect the plans or storyboards for their creation and design.

  • Specifications and Documentation for Existing Markup Systems

    This might include information on the markup systems currently in use, any DTDs on which the DTD resulting from the design work must be based, and any DTDs already available for the scope covered by the project.

    For example, authors might currently use a set of Word for Windows styles or TeX macros, or even other DTDs. These sources of information shouldn't be overlooked; they often provide the most interesting insights into subtle document type requirements.

  • Specifications and Documentation for Existing DTDs

    If your project involves customizing or revising an existing DTD, your work will have several constraints that don't normally exist and the design process will differ from a project to develop a new DTD. Chapter 7, Design Under Special Constraints describes how to incorporate DTD analysis into your work.

    If you don't have a requirement to use a certain DTD as a starting point, be careful not to base all the design work on an existing DTD simply because it exists. Often, real needs can be overlooked in the rush to save time and money by using an existing DTD. This is why we suggest that existing DTDs and analyses be given a thorough look only after the initial gathering of requirements. Section 4.2.3, “Step 3: Validating the Needs Against Similar Analyses” provides more information.

  • Contractual or Legal Standards for Document Delivery or Interchange

    For example, U.S. Securities Exchange Commission filing statements must follow certain rules when filed electronically. These rules should have a priority in the analysis of a document type for filing statements, and may even need to be examined before the project launch to ensure that they are feasible and that different sets of requirements don't conflict with each other.

  • Formatting and Editorial Style Guidelines

    Formatting guidelines can be analyzed to reveal many underlying structural or content-based guidelines. For example, a style guide that specifies that the first division in a manual must be called “Preface” and must not be numbered implies that prefaces should be considered as a separate semantic component.

  • Product and Document Usability Studies and Bug Reports

    These sources indicate how people actually use the documents, which can help identify where opportunities exist for more precision in markup, such that customers can locate and access information in a more thorough or helpful way.

    Of course, bug reports for the document production system itself are particularly useful for revisions of existing DTDs. Section 11.1, “Setting Up and Managing a Bug-Reporting System” discusses the tracking of the production system in more detail.

Often, the best sources of input are more intangible than actual documents you can pick up and look at. The experience of the team members themselves, combined with an in-depth understanding of the desired ways in which the information will be put to use, is often the best source of all. First, for every document that can be examined under the constraints of an analysis session, each person working in the field has probably come across dozens or hundreds of others, whose characteristics they can introduce into the discussion. Second, any ideas for utilizations that were impossible in the existing processing environment won't easily come to light through examining existing documents.

Later in this chapter, we discuss techniques for eliciting wholly “creative” requirements, the evidence for which can't be found in a pile of existing documents.

4.2. Performing the Needs Analysis

Now it's time for the design team to get to work. Section 4.2.1, “Step 1: Identifying Potential Components” discusses the first step, identifying potential components. Section 4.2.2, “Step 2: Classifying Components” describes how to classify them. Section 4.2.3, “Step 3: Validating the Needs Against Similar Analyses” discusses validating the list of components against other work.

We've tried to make the steps as clear as possible and to provide simple examples, but keep in mind that the creativity and ambiguity of all but the most trivial document text can complicate analysis. You should plan on tolerating some indecision, tentativeness, and iteration in the process.

Depending on the number, complexity, and familiarity of samples and other input sources and the SGML modeling experience of the team, analysis can take from one or two staff-days to several staff-weeks. If the team members feel comfortable splitting up some of the analysis work among themselves and performing it individually or in pairs before or between team meetings, the work can proceed more efficiently. In fact, it can be useful to break into pairs containing one person with more experience in the documents they've been assigned and one with less, since each sees things in a different light. However, the analysis results will still have to be correlated in a group setting.

To keep track of all the information collected about semantic components, the recordist needs to keep an online or paper-based record of each component. These component forms will later be assembled into the needs analysis report and, when all their fields are complete, will provide a historical record of the analysis in the final report.

Component forms have several fields, most of which are filled during analysis. Figure 4.4, “Semantic Component Form” shows a blank sample of a component form.

Figure 4.4. Semantic Component Form

Semantic Component Form

The following explanation of the fields also serves as a map for the analysis work. In the following sections we'll show you examples of how the fields might be filled in.

Field When Filled In Description
Component name Step 1

A short, descriptive name for the component. It's important that this name be intuitive and that it have as few confusing associations as possible. If you choose an unusual name, define it in the project glossary.

Avoid determining your list of potential components by working from a single existing markup language. Doing this may result in a set of components that reflects too many of the faults of the existing language.

Number When the component is stable

A unique number for referring to this component during analysis and specification. If you are working on paper, keep in mind that any numbering system you supply will periodically be interrupted by new entries (or by reorganization, if you organize the forms according to their provisional classes). In this case, you may want to wait until step 4 to assign numbers.

Definition Step 1

A clear, brief definition of the meaning or role that the component serves. When a component is proposed, the team should define it before doing any further analytic work on it. Reaching agreement on the definition is crucial. Even seemingly trivial components, such as “paragraph,” can provoke interesting and controversial discussions about their meaning.

Classes Step 2

One or more key phrases placing the component in appropriate logical classes and subclasses. As the definition work moves to classification work, increasingly detailed classes of components will emerge. As you feel you understand a component sufficiently, record the team's current views on its classification. If you're working on paper, keep an eraser handy!

Explanation and examples Steps 1 and 2

Supporting documentation that proves the component was found in the analyzed data. Provide further explanation or definition here as necessary, and reproduce or point to examples of cases where the component is used. If a proposed component is new and doesn't exist in current documents, the purpose is to explain what this new component could have been used for.

If you're able to describe examples of how the component's inner structure or outer context is typically structured, do so here, with outlines or sketches.

Existing markup Steps 1 and 2

Equivalent forms of this component in any related markup languages that are under examination. For example, if the documents in the scope are currently produced with a set of word processor styles, each component should list any styles that are intended to represent or format that component's information. This information is helpful in ensuring that the analysis is complete, and it can also help in specifying the requirements for conversion and autotagging software.

Accepted Modeling phase (step 4)

A Yes/No indication of whether the component has been accepted as a “need.

Why record so much information about components that may not be accepted? First, if you haven't rejected a potential component out of hand by the time you've defined it, the likelihood is high that it, or a similar or related component, will make it into the final requirements. Second, you can't accept or reject a proposed component until you understand it thoroughly enough to recognize the costs and benefits of doing one or the other. Finally, information about components that have been found to be marginally outside the scope can be fed into later, more ambitious projects.

Rationale Modeling phase (step 4)

All the reasons supporting the decision to accept or reject the need for the component. The rationale should refer to specific design principles and project goals in support of the decision. The process of selecting components often spurs the need to clarify existing design principles or add new ones.

SGML markup/related elements Modeling phase (step 10)

A brief description of the markup with which an accepted component was ultimately distinguished, for example, the name of the element, attribute, or attribute value, and a description of the unique contexts that match this component's purpose. Filling in this field signals how the need was met and can help developers of processing applications understand how to access information correctly. When the numbers of the relevant element forms (described in Chapter 5, Document Type Modeling and Specification) are known, they should be filled in here.

Creation/ change history Maintenance phase

The date the form was first filled in, and the dates and brief descriptions of subsequent changes and additions.

Throughout the following sections describing the analysis steps, we'll use, as an example, a hypothetical project to develop a cookbook DTD. The imaginary company CookThis, Inc. currently produces packets of index cards with recipes printed on them that it sells through telemarketing. It wants to expand its offerings to printed cookbooks and electronic recipe hypertexts.

For this project, the main analysis data comes from the following sources:

  • Several dozen recipes marked up in a simple troff -like markup language used by CookThis, several of which we'll reproduce here. A processor that interprets the markup produces PostScript ™-formatted series of index cards.

  • Editorial and internationalization guidelines for printed cookbooks, written by the CookThis editing staff in preparation for the migration.

  • Usability studies on electronic recipe query sessions by users, the results of which you can see at the end of Section 1.4, “SGML Document Processing”.

The CookThis document type design team has agreed on the following initial design principles:

  • Scope: The scope includes individually published recipes for food dishes, along with cookbooks whose main content is organized sets of these recipes.

  • Document utilization: The recipes will continue to be printed on individual cards and assembled into packets. The whole cookbooks will be printed and will also be made available as searchable hypertexts as part of a larger CD-ROM product to which we are contributing.

  • Document creation: Contributing recipe authors will send us diskettes containing recipes formatted with word processors, or hardcopy recipes that must be scanned or typed in. In-house recipe authors will use the SGML-aware editor we've acquired. Each cookbook editor will contract for any necessary conversion.

  • DTD audience: The DTD(s) will be used directly by in-house recipe authors and cookbook editors.

  • Document management: The recipes will be stored individually in our database, and will be assembled into different cookbooks based on their characteristics. The framework text for each cookbook will also be stored individually in the database.

Further, the project goals suggest using DTD fragments that are already supported in a wide variety of authoring and processing software, with the anticipation that an existing table model can already be used. Thus, with the advice of the DTD implementor, the team has added information on CALS tables to their analysis input, to be examined closely in step 3.

In this chapter and in Chapter 5, Document Type Modeling and Specification, each step will be followed with a summary of the most important points to remember in performing that step.

4.2.1. Step 1: Identifying Potential Components

Some potential recipe components can be identified easily through simple observation; others need more imagination, and are often suggested by the project's anticipated document utilizations. Here, we'll examine several recipes and other data and identify some (though certainly not all) potential components in them, beginning to fill out the component forms in the process.

We've picked the smallest practical example that still has the richness of a typical DTD development project, and yet, as you'll see, the number of potential components can be nearly overwhelming.

The simple markup language used to produce these recipes is called rcard; it is documented in Table 4.1, “rcard Markup Language Documentation”. Note that rcard is not a contextual language, so the markup can appear in any order and configuration according to common sense. However, the language is relatively declarative, and obviously identifies some important content-based characteristics of recipes.

Table 4.1. rcard Markup Language Documentation

Markup Description
.TI title Title of recipe
.SO source Source of recipe
.IN Start of ingredient list
Line of text inside ingredient markup Item in ingredient list
.EN or .TE End of ingredient list
.NO [text] Note title (the default title text is “Notes:”)
^ Typeset degree symbol (for example, 350^F would produce “350°F”)
[n-]num/denom Fraction (for example, 1-1/2 would produce “”)
Lines of text outside ingredient markup Paragraph of text; a blank line signifies the start of a new paragraph or other construct
# Start of comment line to be ignored in the output

For example, the markup and content for the cookie recipe shown later in Figure 4.6, “Sample Cookie Recipe” is as follows.

.TI Meringue Cookies
1/8 tsp salt (3 shakes)
3 egg whites
3/4 (generous) cup sugar
1 tsp vanilla
1 pkg chocolate chips, coconut, or whatever (optional)
Mix salt, egg whites; beat until stiff peaks form.
Slowly beat in sugar; you'll feel the mixture getting stiff.
Fold in vanilla (and chocolate chips).
Preheat the oven to 300^F.
Drop cookies onto greased, floured sheet.
Bake for 30 minutes.
.NO Other uses:
Cookies are good crumbled on top of ice cream.
.NO Variations:
Use rum instead of vanilla!

Figure 4.5, “Sample Chicken Recipe” shows a recipe for chicken in its printed form.

Figure 4.5. Sample Chicken Recipe

Sample Chicken Recipe

The following components might be identified so far.

Component Name Definition Explanation and Examples
Recipe A complete description of how to prepare a dish, with food ingredients and instructions supplied The general structure of a recipe appears to be: ingredients first, then instructions.
Recipe title The official label for the recipe Every sample recipe has a title. Titles can be fanciful (as in “Debbie's Chicken”).
Source of recipe The person or institution contributing the recipe  
Ingredient list A collection of information about the food substances needed for the recipe

The general structure of an ingredient list is one or more ingredients, listed in approximately the order they'll be needed in the preparation of the dish. Some ingredients can be further subgrouped.

Note that in this recipe, many optional ingredients are mentioned in the text that are not listed in the “official” ingredient list at the top. The way this recipe has been organized and written may later be a problem for full document utilization, even if an ideal DTD were constructed to hold ingredient information.

Ingredient sublist A subgrouping of ingredients Not all the ingredients in this recipe need subgrouping.
Ingredient A food substance required for the recipe Each ingredient is listed on a separate line. In this recipe, ingredients don't have amounts. Sometimes a description is provided to give context to the ingredient; for example, this recipe lists oil “for browning.” Also, some “ingredients” are actually a mixture of individual food substances that must be combined as part of the process.
Number A part of an ingredient's amount This recipe doesn't mention amounts in ingredients, but the instruction text does contain some more precise amounts, with numbers.
Instruction list Collection of directions for producing a dish In this recipe, the instructions are an ordered series of blocks of text, each constituting one major step.
Instruction step Single major phase in the preparation of a recipe Each step is made of a block of text similar to a paragraph.
Unit of measurement A type of amount This recipe mentions tablespoons (Tbs) of oil.
Recipe type A general category into which the dish falls, typically an indication of the course it is served in This information isn't recorded in the recipe, but this dish falls into one or more categories: meat dishes, chicken dishes, main dishes, and so on. Perhaps, for our retrieval plans, we might consider recording each dish's “nationality” too.
Preparation time Amount of time the recipe needs for preparation This information isn't recorded in the recipe, but the usability studies suggested that busy cooks want to know this information.
Yield The number of pieces or servings the recipe produces As for preparation time, this may be useful information to supply.
Recipe cross-reference Mechanism to help the reader find a recipe related to the current one This recipe mentions chicken broth, and we have a recipe for homemade chicken broth, which suggests a link between the mention and the recipe would be helpful.

Figure 4.6, “Sample Cookie Recipe” shows a recipe for cookies.

Figure 4.6. Sample Cookie Recipe

Sample Cookie Recipe

The following additions and changes might be made to the component forms.

Component Name Definition Explanation and Examples
Source of recipe   Not every recipe has a source.
Ingredient sublist   Not every ingredient list uses sublists.
Amount   Amounts appear to be used in the contexts of both ingredient components (for example, “3 eggs”) and instruction steps (for example, “bake for 30 minutes”). Not all ingredients have amounts listed, or at least amounts that are precise enough to be measured. One ingredient even has the notation “(generous)” in it.
Number   Note that the rcard markup has a special feature for producing fractional numbers, which are common in ingredient amounts. This recipe has some fractions.
Degrees The heat setting of an oven for baking Note that the rcard markup has a special feature for producing the degree symbol for oven settings. This recipe needs this symbol. But is it an acceptable presentational component, or is it simply another unit of measurement?
Unit of measurement   This recipes adds cups, packages, degrees Fahrenheit, and minutes to the kinds of units of measurement.
Ingredient optionality Whether or not an ingredient is required in preparing the dish The last ingredient line has the text “(optional)” at the end, suggesting that this ingredient doesn't need to be added. (Looking back, the instruction text in Debbie's Chicken also suggests that some ingredients not yet shown in the list are optional, so there's strong evidence that this component exists.)
Ingredient alternative A suggestion for equivalent ingredients, or ingredients that achieve a different effect This component may be related to ingredient optionality. Meringue Cookies allows chocolate chips, coconut, or a substance of the cook's choosing. Also, the recipe variation supplied in the same recipe suggests another simple ingredient substitution.
Instruction step   This recipe seems to have only one major step. If it had been written differently, it could appear to have more steps, which perhaps would be more appropriate.
Note A cautionary admonition or suggestion for serving the dish The rcard markup suggests this component. This recipe has two notes. The general structure of a note seems to be a title, followed by a block of text. (So far it's unclear whether notes are part of a recipe as a whole or part of the recipe instructions specifically.)
Note title A descriptive label for a recipe note The Meringue Cookies recipe has two different notes, neither of which uses the default note title set up by the rcard markup. Do these titles suggest that instead of structural notes, these should be content-based “serving suggestion” and “variation” containers?
Instruction optionality Whether or not a task is required in preparing the dish Part of an instruction in this recipe is in parentheses and corresponds to the optional ingredient, suggesting that the instructions are textually “conditionalized” for the two circumstances.

Figure 4.7, “Sample Cake Recipe” shows a recipe for chocolate cake.

Figure 4.7. Sample Cake Recipe

Sample Cake Recipe

The following additions and changes might be made to the component forms.

Component Name Definition Explanation and Examples
Ingredient sublist   Some sublists have titles.
Ingredient sublist title Short description of a collection of some of the ingredients needed for a recipe This recipe identifies the ingredients for the frosting separately, though the process for making the frosting is simply part of the whole cake-making process.
Instruction title Label on an instruction to explain its purpose The step for making icing is labeled in this way. It could be evidence of a “subrecipe” component, but we have chosen to interpret it this way.
Trademarked term A term “owned” by someone or some organization One of the ingredients in this recipe mentions the trade name of a kind of chocolate. We need to track uses of trademarks so that we can give proper attribution to the owner and add trademark symbols as appropriate.
Degrees   This recipe uses an absolute heat setting from 1 to 10. Perhaps this component should be broadened, so that users of the hypertext version can switch between absolute settings and Fahrenheit and Celsius measurements, depending on how their ovens are marked.
Unit of measurement   This recipe adds absolute heat settings and grams to the kinds of units of measurement.

In addition, while we don't show the analysis input for it here, following are some of the potential components that can be identified for the overall structure and contents of entire cookbooks.

Component Name Definition Explanation and Examples
Cookbook A packaged collection of recipes and related information for sale as a unit The general structure we expect cookbooks to have is: preface (if any), then acknowledgments section (if any), then several recipe sets, then appendices (if any).
Cookbook descriptive information The major identifying information about the cookbook, usually found on the title page. All the printed cookbooks will have title pages. But will our hypertext version have them? Therefore, let's not call it a “title page.
Cookbook title The official label for the cookbook For example, “Dishes from the Far East.
Editor(s) The one or more people who were responsible for compiling or writing the cookbook's contents Every cookbook has at least one editor assigned to it.
Legal information The copyright notice and other legally required notices, usually on the copyright page that backs up the title page. All the printed cookbooks will have copyright pages. But will our hypertext version have them?
Cookbook copyright Information about the holder of the copyright We own the copyright on the entire publication. Copyright information includes the owner and the year(s) the work is copyrighted.
Individual recipe copyright Information about the copyrights held on individual recipes and the publisher's permission to use them We sometimes reproduce recipes from other publications, with permission. Is this information part of the cookbook, or the individual recipe?
Trademarked term attribution Acknowledgment of the owner of a term or phrase used in the text Sometimes we mention brand names of cooking implements or prepared food, which names we must credit to their owners. For example, Nestlé is a registered trademark of the Nestlé Food Company.
Preface An introduction to the cookbook as a whole and how to use it effectively We plan to supply a preface for all our cookbooks.
Acknowledgments section Text thanking recipe contributors and other assistants in the creation of the cookbook We will probably supply an acknowledgments section for most of our cookbooks.
Recipe set A major division of the cookbook containing closely related recipes We may or may not number the sets. The general structure of a set is: an introduction, then the recipes, possibly with a preceding small introduction on each.
Recipe set title Descriptive label for a recipe set For example, “Chicken Dishes.
Recipe set introduction Paragraphs and other material describing the nature of a recipe set Every recipe set will have an introduction explaining the relatedness of the recipes.
Individual recipe introduction Paragraphs and other material describing the nature of a recipe Some recipes will be introduced specially.
Paragraph A block of prose text General-purpose paragraphs are different from instruction steps and so on, in that they can contain any kind of information.
List Collection of related small pieces of information In introductions and so on, lists (for example, of cooking implements that should be on hand) might appear. These are different from ingredient lists and instruction lists in that the list can contain any kind of information.
List item One small piece of information in a collection, made of a block of text These might be, for example, the individual cooking implements.
Illustration A line drawing of a concept or cooking utensil For example, we intend to illustrate julienne technique with drawings.
Photograph A real-life picture of a finished dish or ingredient The recipe sets may show a sampling of dishes produced by the recipes in the set.
Caption A brief explanation of what's going on in the illustration or photograph For example, “Old-Fashioned Flour Sifter.” We're not sure if all illustrations and photographs will be captioned. Some may be relatively “incidental” to the text.
Table An arrangement of bits of information along two dimensions For example, tables containing measurement equivalents in different systems might go in an appendix.
Appendix A collection of information supplementary to a recipe set or the cookbook as a whole For example, a cookbook might have an appendix for measurement equivalents or information on companies where certain cooking implements and ingredients can be found.
Appendix title The descriptive label for an appendix For example, “Sources of Exotic Ingredients.

Following is a checklist of analysis tasks that shouldn't be overlooked, with comments on how they relate to the work done to date on the CookThis DTD project.

  • Set Limits

    For all highly complex structures, sometimes it can be a great temptation to analyze the data to the ultimate degree, and it's hard to know where to stop. For example, ingredient measurements can probably be made as complex as the energy level of the team allows. You might want to set a time limit for such discussions in the analysis phase, and let the issue mature naturally as you move to the modeling phase.

    You should also assess, before spending a great deal of time on analysis, whether the information is directly related to your information domain and how much tools support will be available for processing it. For example, it is of more value to analyze “measurement equivalence tables” than tables in general. The facilitator and DTD implementor can help team members do this assessment, and in step 3 can provide information on any standard DTD fragments (such as for tables, mathematical equations, and electronic review) that might be able to be applied to the information.

    In the CookThis scenario, the team has determined that its focus isn't on sophisticated uses of tabular information, but it could have been otherwise. For example, if the hypertext cookbooks were supposed to offer a facility for interactive computation of measurement equivalents, it would have made sense to have special markup for equivalence tables.

  • Be Open to Component Ideas

    Component ideas can come from anywhere. For example, the ingredient optionality component was generated from the observation that certain text seems especially significant.

    Some components may suggest other obvious components. For example, the cookbook copyright information contains “owner” and “year” information—possibly two new components. And the trademarked term component may have been what suggested the idea of a trademark attribution component.

    Any respectable component idea that arises in discussion should be given its day in court. For example, if the team can imagine that it's useful to put exact ingredient amounts in boldface type to distinguish them from approximate amounts, it's worthwhile to add the measurement precision component for consideration. (Alternatively, the team may decide to cast the two components in terms of an exact amount and an approximate amount.)

  • Recognize Cloned Components

    Some components might be better off collapsed into a single entry early on. All components should have unique names, so the repeated components should be either collapsed or distinguished by the end of step 1. In addition, any components that overlap each other in purpose or meaning should be sorted out in step 1. For example, an earlier version of the recipe components might have had “items” instead of “steps” as the contents of the instruction list, but later distinguished them from general-purpose list items.

    If you have some doubt about how much distinction exists between components, don't collapse them; they may simply be good targets for creating functional classes in step 2. The component names that have been chosen may hint in this direction. For example, a “figure caption” may seem similar to a recipe title, but the fact that it is known by a different name suggests that there may be differences in structure or usage.

  • Abstract Away from Presentation

    Some component ideas might appear to be quite presentational in nature. Rather than dismiss these components out of hand, when you define them be sensitive to ways to make them more abstract. The components for cookbook descriptive information and legal information would have obscured useful ways of using the data if they had been thought of as “pages.”For example, although a printed cookbook's title page might display the cookbook title, other places, such as the readers' comments form in the back, might reproduce the same information. The title isn't just a “title page” phenomenon.

  • Document What You Find

    The component explanations and examples should be as thorough as possible, since they can provide useful information for many parts of the DTD development process. For example, the description of the ingredient component describes some of the supplementary information (the purpose of the ingredient in the recipe) that might be present in what otherwise might seem to be a simple ingredient specification. This bit of background can be helpful in further analysis and in the DTD documentation.

    For components with obvious “content models,” provide descriptions of how their content is likely to be structured. This information will be fed into the modeling work. For example, recipes and cookbooks both have notes on the structure of their content. Likewise, it's helpful to record questions about the context of a component, as is done for notes.

  • Acknowledge Fuzzy Boundaries

    Differences between writing styles can be problematic as you try to fit semantic components into neat packages. For example, the explanations of the instruction-related components show that it's hard to identify the boundaries between one step and another, and it's even hard to tell how “major” or “minor” a step is. There's plenty of opportunity to ponder these issues before an answer must be forced in the modeling phase, but if a certain direction becomes clear during the analysis discussions, it should be recorded and agreed to. For example, if you can easily agree that any two wholly sequential tasks must be marked up as separate major steps, record it in the component's definition. Often, new design principles will arise out of these discussions.

  • Go Beyond Printed Samples

    Once the obvious components have been identified, it is useful to have brainstorming sessions to identify potential components that can help meet the project's more sohpisticated goals for document utilization. This process is sometimes called “enriching the data” because it involves considering markup for information that authors haven't yet supplied. Some of the component explanations hint at the possibilities. For example, the recipe type component might suggest to the team that knowing whether a dish is vegetarian or not is useful. Should this component be considered? What about each dish's “nationality” or origin (and how does this differ from the “source” information)? Should precise nutritional information per serving be supplied for each recipe?

In order to move to step 2, the team should feel that it has extracted all the reasonable candidate components it can think of, and it should have completed at least the following component form fields. However, component forms can still be added at any time, and field values can be modified or expanded on.

  • Component name

  • Definition

  • Explanation and examples

  • Existing markup

Further, any phrases coined by the design team and any specialized uses of terms should be defined in the project glossary.

4.2.2. Step 2: Classifying Components

Step 2 involves continuing to analyze the components by sorting them into classes and subclasses.

The classes can be entirely of the team's design, reflecting whatever content-based and structural similarities the team members observe among the components. For example, a list-component class might represent variations on a theme from which authors might choose when marking up “enumerated” information. The modeling phase will superimpose a common set of superclasses on the components in order to help the modeling work proceed in an organized way, but you may find that the added classes bear a strong resemblance to some you've already constructed in this step.

Sometimes no natural home can be found for a component. However, we've found that attempting a thorough classification has several advantages:

  • It clarifies any remaining misunderstandings about the components and can bring to mind missing components and new places to apply existing ones.

  • It can eliminate unnecessary components. The grouping of seemingly similar components can reveal that they are actually identical, whereupon they can be collapsed.

  • It allows similar components to be treated similarly in the modeling work (as well as allowing the resulting markup to be treated similarly by processing applications where appropriate).

  • It begins the modeling process that will be conducted in earnest in the modeling phase. By the time you finish the classification work, you'll probably have strong suspicions about the nature of the markup many components will end up with (though you should wait until the actual modeling phase to determine the outcome for all components).

  • It will later help the DTD implementor to structure the DTD for easy maintenance and customization.

Following is one possibility for classification of the recipe and cookbook components identified in Section 4.2.1, “Step 1: Identifying Potential Components”. The components are indented to show their approximate hierarchical relationship, which begins to suggest how they might be structured when they are expressed as elements and attributes in the modeling phase. Some elements are in multiple classes. The “Classes” fields in the component forms should be filled in with the class names shown here.

  Cookbook descriptive information
    Cookbook title
  Legal information
    Cookbook copyright
    Trademarked term attribution
  Acknowledgments section
  Recipe set
    Recipe set title
    Recipe set introduction
  Appendix title

  Recipe title
  Source of recipe
  Recipe type
  Recipe preparation time
  Recipe yield
  Recipe copyright
  Recipe introduction
  Ingredient list
  Instruction list

Ingredient list
 Ingredient sublist
  Ingredient sublist title
  Ingredient optionality
  Ingredient alternative
Instruction list
  Instruction step
  Instruction optionality
  Instruction title
  List item

  Note title



Unit of measurement
Recipe cross-reference
Oven setting
Trademarked term

Cookbook title
Recipe title
Recipe set title
Appendix title
Note title
Ingredient sublist title

Make sure to record your chosen classes in the component forms. If you invent any terminology in naming your classes (for example, “significant data”), define them in the project glossary.

4.2.3. Step 3: Validating the Needs Against Similar Analyses

Step 3 offers an opportunity for the team to examine its list of potential components in light of any DTD work that has been done for similar documents, if such work exists.

The reason this comparison is done third, rather than first, is that looking at material developed by other people can bias the results of your own work. Once the team has begun to develop a working style and has charted a course, looking at similar DTDs can remind the team members of obvious components they missed and help them resist adding components that they honestly see no need for.

Why not just start with an existing DTD and change what needs to be changed, rather than going through steps 1 and 2? In some cases, discussed in Chapter 7, Design Under Special Constraints, the project may indeed have specified constraints and requirements that suggest customizing an existing DTD rather than starting from scratch. However, needs analysis is still required for that process. Also, the time and cost savings of starting with an existing DTD are often overrated because the goals of the designers of the original DTD don't match those of the variant-DTD designers.

Along with examining DTDs that already exist for the information domain of the project, it may be appropriate for the team to examine existing DTD fragments that might be able to be used whole, such as tables, mathematical equations, and constructs for electronic review of documents. The facilitator and DTD implementor can help with this examination. In the case of the CookThis project, the DTD implementor has suggested the CALS table fragment, and has spent some time describing its characteristics and how it compares to other table fragments supported by specific software vendors. Ultimately, the team felt comfortable accepting the CALS model.

One task you should pay particular attention to at this point is comparing the definitions of similarly named components, elements, and word processor styles. Often the same terminology is used for radically different things, and you may want to consider aligning your terminology with accepted industry practice, if there is sufficient precedent.

At the conclusion of step 3, the team members have a pile of forms describing all the potential markup distinctions that the model needs. For convenience in the modeling and specification steps discussed in Chapter 5, Document Type Modeling and Specification, the recordist should compile a list of all the component names, possibly with their existing markup equivalents next to them arranged in a matrix for quick reference.

[7] Here we mean documentation of how a human should perform a function, rather than procedural markup that tells a computer how to format document data. A recipe instruction list is a kind of procedure.

[8] We'll discuss an alternative kind of content-based interpretation in Section 6.3, “Documents as Databases”.