[The HTML Writers Guild Logo]


The HTML Writers Guild

Project gutenberg
[Previous] [HWG Homepage] [Gutenberg Index] [Next]

Marking up documents in XML

Here are some notes on using XML to markup a document. The following pages are under heavy construction! There will eventually be a number of pages that discuss marking up a document using the various book DTD's. This page though contains a few tips to get you started.

The process of marking up a document in XML follows certain well defined steps. Although some parts of the process can be 'automated', for the most part it still relies heavily on a manual process. It cannot be emphasised too strongly that the most important tool, is a really good text editor. Without this you are sunk! (See tools of the trade). Here are the steps that are normally followed in marking up a document.

Examining the e-text

The first thing to do is examine the e-text that you are going to markup. Look for natural divisions and structures. Does it have front or back matter? Does the main body of the book have natural divisions such as parts, chapters, stanzas? Obtaining a 'dead tree' copy of the book in question will often facilitate this process. If you have a copy of the latter you can also scan in any illustrations that the book may contain.

At this stage you may want to break down the e-text into several fragments and work on one fragment at a time.

Once you have a feel for the structure of the book you are in a position to choose a DTD.

Choosing a DTD

At present the selection of DTD's is rather limited (unless you want to use one of the commercial book DTD's available, but then if you are skilled enough to do this you will probably not be reading this note anyway :>)! Details of the DTD's can be found here.

There are accounts and descriptions available of the gutpoems1.dtd and the gutbook1.dtd.

Doing the inital markup

Most e-texts have line breaks at the end of a line, and double line breaks before and after a major structural division. For example here is the end of one poem, and the begining of another poem.

As far surpassing other common villains
As Thou in natural parts has given me more.

Tarbolton Lasses, The

If ye gae up to yon hill-tap,
Ye'll there see bonie Peggy;
She kens her father is a laird,
And she forsooth's a leddy.

There Sophy tight, a lassie bright,
Besides a handsome fortune:
Wha canna win her in a night,
Has little art in courtin'.

If we were using the poemsfrag.dtd we would want to preserve all the lines, so we could use the 'Find and Replace' function of EditPad to replace \n (this is how you define a new line in EditPad) with </line>\n<line>. This is what you will see.

As far surpassing other common villains</line>
<line>As Thou in natural parts has given me more.</line>
<line></line>
<line>Tarbolton Lasses, The</line>
<line></line>
<line>If ye gae up to yon hill-tap,</line>
<line>Ye'll there see bonie Peggy;</line>
<line>She kens her father is a laird,</line>
<line>And she forsooth's a leddy.</line>
<line></line>
<line>There Sophy tight, a lassie bright,</line>
<line>Besides a handsome fortune:</line>
<line>Wha canna win her in a night,</line>
<line>Has little art in courtin'.</line>
<line></line>
<line>

Move the last <line> to the top, and again use the Find and Replace function to replace <line></line> with </verse>\n\n<verse>. This is what we will get.

<line>As far surpassing other common villains</line>
<line>As Thou in natural parts has given me more.</line>
</verse>

<verse>
<line>Tarbolton Lasses, The</line>
</verse>

<verse>
<line>If ye gae up to yon hill-tap,</line>
<line>Ye'll there see bonie Peggy;</line>
<line>She kens her father is a laird,</line>
<line>And she forsooth's a leddy.</line>
</verse>

<verse>
<line>There Sophy tight, a lassie bright,</line>
<line>Besides a handsome fortune:</line>
<line>Wha canna win her in a night,</line>
<line>Has little art in courtin'.</line>
</verse>

<verse>

Again move the last <verse> to the top.

Now we have all our basic markup. Although we have only been using a small fragment here, the ammount of effort is the same whether we are marking up 2 or a 100 poems!

In poetry of course the individual lines are important, in a book they are not, and we would probably just look for a double line break,\n\n and replace it with \n</para>\n\n<para>\n which would divide our initial text up into <para> elements.

All though we can automate this part of the process, the next part, which is adding in the other elements, can be labor intensive!

Refining the markup

This is where marking up the document becomes a labor of love! Although there are a few short-cuts we can take, unfortunately one really has to go through the whole document and put in the detailed element structure. However, here is a tip that can save countless hours. In the markup below:

<verse>
<line>Tarbolton Lasses, The</line>
</verse>

Needs to be changed to

<title>Tarbolton Lasses, The</title>

This is easy enough to do, but it is time consuming if one has to do it for a hundred or more instances! A much easier way to do this is to go through the whole text and change it to:

<averse>
<line>Tarbolton Lasses, The</line>
</averse>

Now use the 'Find and Replace' function to convert <averse>\n<line> to <title>, and </line>\n</averse> to </title>. This you will find saves a lot of time!

Using scripts

If you are are familiar with scripting languages such as Perl or JavaScript, it is possible to write scripts to automate some of the stages of the process, but unfortunately scripts rarely can be used to refine the mark-up! If you are a scripting wiz though, you may want to develop a series of scripts. Actually, although I have such a series of scripts, since I have discovered EditPad, I rarely use them, because EditPads ability to replace new lines makes it quicker for me to use this feature, than it is for me to fire up my scripting engine!

Finding DTD's

There are accounts and descriptions available of the gutpoems1.dtd and the gutbook1.dtd.

[Previous] [HWG Homepage] [Gutenberg Index] [Next]

[Valid XHTML 1.0]
This page is maintained by frank@hwg.org. Last updated on 7 February 2000.
Copyright © 2000 by the HTML Writers Guild, Inc.