Developing SGML DTDs


I cowrote a book called Developing SGML DTDs: From Text to Model to Markup with my dear friend and colleague Jeanne El Andaloussi. It describes a methodology for SGML DTD design and development that you may find useful for XML too, especially if you’re developing a DTD (rather than, say, an XML Schema) that is intended for the editing and publishing of largely narrative-form documents. The writing process stretched from early 1994 to mid-1995 and included a lot of Boston/Paris trips, respectively. The book hit the stores in December 1995 and finally went out of print in the summer of 2005.

If you’re interested to read Developing SGML DTDs, you have two choices: find a printed copy somewhere or read it online. We made the online version available, with the help of the very talented Norm Walsh, on February 10th, 2008: the tenth anniversary of the publication of the XML V1.0 Recommendation. (XML V1.0 has seen several subsequent editions.)

Production Notes

The Preface has a colophon describing how the book was originally written and produced. It went to the publisher in DocBook V2.2.1 form and even included a filled-out DocBook Questionnaire. To prepare it for online publication in HTML form, Norm converted it to XML DocBook V4 as follows (in his own words):

I didn’t like the output from sgml2xml, so I did it with perl. Fix empty tags, remap character entities, convert entityref into fileref, turn the remaining entities into xincludes. Run through db4-upgrade. Fix a handful of things that didn’t work quite right because db2 wasn’t quite the same as db4. :-)

The only by-hand part was index terms. Most files had one or two places where the indexer wrote:

</section>
<indexterm .../>
</section>

That was valid in SGML DocBook because indexterms were an inclusion. In XML DocBook, content is forbidden after the close of a section, so it wasn’t. Whether DocBook’s content models should be relaxed so that it is legal is an interesting markup question.

I just moved the indexterms before the first </section>.

Norm did some beautiful custom work for the online version, based on DocBook stylesheets he’d already developed. I’m deeply in his debt.

Print/Online Differences

Following are the differences between the print and online versions of the book that I’m aware of, beyond obvious differences in print vs. browser formatting. It should be noted that the online version was prepared from the SGML source files that we provided to Prentice Hall PTR, and that many text improvements were made during final copy editing and typesetting. I’ve found and “synced” a few, helped immensely by the fact that we authors were clever enough to use DocBook’s <comment> (now <remark>) element to mark typesetting issues and incomplete text! But there are many, many others lurking within. I’m going to try to sync them by hand over time.

  • Legal notices: The notices material is slightly different to account for the change in “publisher”. The notices at the bottom are also wrapped together rather than being separate paragraphs. Beware: The legal notices are a mix of old original ones (many no longer valid, like the Visio trademark belonging to “Shapeware”) and new ones (like the note about Prentice Hall).
  • ISO entities: Appendix D, which lists ISO character entity sets, contains a new column for Unicode-number equivalents thanks to the good graces of Norm.
  • Generated text: Internal cross-references, which in the printed version were generated to include only strings like Figure 1.4 (or page numbers in the index), in the online version include numbers as well as title/caption text: Figure 1.4, “Recipe Attributes”. Also, captions do not end with a (generated) period in the online version. Finally, the print and online TOCs have varying levels of detail.
  • Extra links: Cross-references that were invisible in printed form where they “silently” surrounded ordinary text have been rendered as hypertext links in the HTML conversion. This can be seen in Section 2.1.

Errata

Jeanne and I want to take this opportunity to thank all those who commented on the book during review and after publication; in addition to those we mentioned in the Acknowledgments section, we’d like to mention Dave Peterson of SGMLWorks! and Diederik Gerth van Wijk of Kluwer Rechtswetenschappen specially because they provided many corrections and thoughtful comments after the book was out. We’re glad for this chance to advertise the corrections.

Following are errata for the printed (and online, unless otherwise noted) versions of the book. Errata exclusive to the online version will be corrected in place to the best of my ability; please do let me know of any issues you notice.

  • p. 186: Figure 6-4, “Modeling Choice with Single Element and Multiple Attribute Values”, had a bad line break. The online version fixes this.
  • p. 207: Section 7.1 has a typo in its penultimate paragraph. “If may even be best” should read “It may even be best”.
  • p. 215 and p. 218: Section 8.1.1 has an error in each of its first and last DTD examples; in the first, the second declaration of a “chapter” element type should be removed, and in the last, the element type “manual-title” should have the name “title”.
  • p. 239: Section 8.3.1 gives an example of how to prepare for extending a list of enumerated token values for an attribute. The example is correct, but it may be misleading in that it appears to contravene the advice in Section 9.3 about not nesting entities unnecessarily. The &newvals; entity has two potential roles: being empty (before extension) and containing one or more connector-value pairs. SGML prevents top-level entities from having the latter syntactic role, that is, from beginning with a connector.
  • p. 246: Section 8.5 refers to “naming elements”, more properly called “naming element types”.
  • p. 251: Section 8.6 has a DTD example that is, at the minimum, misleading. In the penultimate bullet item in the final list, it shows “note” as potentially allowing end-tags to be omitted. However, the purpose of the bullet item itself, and the instance example below, is to demonstrate that even applying a pattern of “- O” to all IUs may not result in many end-tag savings, since the instance demonstrates the need for a note end-tag in most location given the scenario under discussion. Only a note that appears as the very last IU in a section could afford a missing end-tag.
  • p. 273: Section 9.4’s options for parameter entity choices may be misleading. Option 1 will need parentheses around the point of reference to be usable. Option 2’s and 3’s required natures may be subverted if their point of reference is surrounded by an optional group.
  • p. 297: Chapter 11’s introduction contains an unresolved cross-reference in the first full sentence of the bullet item “Validating and reviewing…”. It should refer to Section 11.2.
  • p. 308: Section 11.4.2 contains the awkward phrase “that it”, which makes a hash of the sentence. The sentence should read “…the DTD may still be difficult to use or may make the process of choosing markup too difficult.”
  • p. 352: Figure A-1, “Element Declaration Syntax”, would be more accurate if its labels that begin with element-name used element-type-name instead. Likewise, calling the figure “Element Type Declaration” would be more accurate.
  • p. 357: Section A.2 describes name-token-group incorrectly. Name tokens can’t contain non-NAME characters and can’t be quoted.
  • p. 358: Table A-1, “Attribute Declared Values”, has a misleading description of CDATA values. The “wrong” example will parse correctly but the strings that appear to be <emph> tags will not be recognized as markup. Further, the description of ENTITY values is misleading; such a value may or may not contribute to an element’s content, and in any case does not get parsed as such.
  • p. 365: Figure A-4, “General Entity Declaration Syntax”, is missing a period at the end of the caption. This period is generated text exclusive to the printed version.
  • p. 372: Figure A-7, “Marked Declaration Syntax”, is missing a word in the caption; it should read “Marked Section Declaration Syntax”.
  • p. 377: Figure A-11, “SGML Declaration Syntax”, has an error in its first line; it should read <!SGML “ISO 8879:year.
  • p. 388: Figure A-15, “SYNTAX Parameter Syntax”, has a typo. LCNNCHAR should be LCNMCHAR.
  • p. 389 ff.: Section A.9.4 describes numeric character references incorrectly in its description of BASESET. They should not be called “numeric character entity references”. In similar fashion, named character references should not be described as character entity references, as they are here. (These errors occur elsewhere in this appendix as well; BASESET is the first occurrence.)
  • p. 390: To my horror, Section A.9.4 offers a description of NAMECASE GENERAL and ENTITY that is precisely backwards. Rather, the settings of YES and NO are answers to the question “Should names be folded to uppercase?” The default for GENERAL is YES (do fold to uppercase) and for ENTITY is NO (don’t fold to uppercase; that is, remain case sensitive).
  • p. 475: Appendix D shows an incorrect character for napos. The apostrophe should just barely precede the n, touching it. (The online version shows a correct likeness.)
  • p. 492: Appendix E is missing a period after the Davenport Group bibliography entry.
  • p. 495: The Glossary’s definition of “comment” is incomplete when it describes comment declarations as appearing only interspersed throughout other markup declarations in a DTD; they can also appear within other markup declarations.
  • p. 495: The Glossary’s entry for “concrete syntax” is sorted incorrectly; it should be just before “content-based component”.
  • p. 502: The Glossary’s entry for “International Organization for Standardization” is missing a closing parenthesis in its cross-reference to the entry just below.