[Humanist] 28.434 HTML vs XML for TEI -- and TEI Simple

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Oct 24 07:09:49 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 434.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (354)
        Subject: Re:  28.427 HTML vs XML for TEI -- and TEI Simple

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (303)
        Subject: Re:  28.427 HTML vs XML for TEI -- and TEI Simple


--[1]------------------------------------------------------------------------
        Date: Thu, 23 Oct 2014 16:47:26 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.427 HTML vs XML for TEI -- and TEI Simple
        In-Reply-To: <20141023051908.587546CA7 at digitalhumanities.org>


Hi Martin,

you're ignoring the RDFa part of my proposal, which bears the semantic
information. I wasn't proposing eliminating anything useful from the TEI
scheme, just expressing it in an abstract way for use in more modern, and
future technologies.

In your example you reduce to the absurd the variability of legitimate but
unlikely encodings in HTML for poetic lines. In TEI one can play even
wilder games with the same material, because there are many more tags with
almost the same meaning, plus looser attribute definitions, to play with:

<l>I wandered lonely as a cloud,</l>...
<div type="stanza">I wandered lonely as a cloud,<lb/>...
<div type="line">I wandered lonely as a cloud,</div>
<ab type="line">I wandered lonely as a cloud,</ab>
<seg type="line">I wandered lonely as a cloud,</seg>
etc.

In TEI an <l> element may contain any one of 196 different types of other
TEI elements, and may itself be contained by 53 different types of
elements. I don't see how that is highly constrained as claimed.

I think you must be misunderstanding the purpose of surface/zone markup.
> The idea here is to be able to link areas on images (typically page-images
> in a facsimile) to other aspects of markup; so, for example, one might
> define a zone outlining a stanza in a poem, and link that to a
> transcription of the poem encoded using <lg> and <l>. There are no
> implications for rendering whatsoever.
>

As I said before, we may use that information in the process of rendering
> an online facsimile edition (for example); but all it's actually saying is:
> here is a shape on the page-image, with an @xml:id.
>

In that case I suggest that you rename the TEI Guidelines the TIEI (Text
and Image encoding) Guidelines, since it now contains markup for images.
You have to draw the line somewhere, and the element in question does not
describe text. Such information should be external to the textual
surrogate, not part of it.

Hi Ken,

no I hadn't seen that one, but this one springs to mind also:
http://www.google.com/trends/explore#q=xml,json
Perhaps people don't realise how many billions of queries these graphs are
based on. The decline is in XML's popularity is very real.

Hi Martin,

I'm sorry that I don't share your enthusiasm for TEI Simple, as it is
described. I can only ask what went wrong with TEI-Lite and TEI-Tite and
DTA-basis format and TextGrid baseline encoding that TEI-Simple is going to
fix? Could I perhaps interest you instead in basing TEI Simple on
*abstract* properties of text rather than a fixed XML syntax? Imposing a
strict syntax even at a coarse grained level will I fear not work, because
everyone interprets the same codes differently, however simple they are.
Any attempt to retrieve information from deeply encoded documents which
have been marked up by humans - exactly as you point out in your quote -
will have a very poor recall factor, although it will be precise. If your
textual enquiry is roughly "find all the quotes in all the lines of all the
stanzas of all the poems by Joe Bloggs", a percentage of the elements
retrieved will be lost at each level of the hierarchy due to variations in
the way those elements are encoded, until you may find no such quotes at
all, even though hundreds of them may exist. This is what the DTA already
complained about (Geyken et al. 2012). You should also reconsider what
Patrick Durusau said in Electronic Textual Editing about the loss of
information that variation in the encoding of even a *single* tag leads to.

The expressed goal of TEI Tite was to specify *"exactly one* way of
encoding a particular feature of a document in as many cases as possible,
ensuring that any two encoders would produce the same XML document for a
source document." If it succeeded in that regard, I don't understand the
need for TEI Simple.

Desmond Schmidt
Queensland University of Technology



--[2]------------------------------------------------------------------------
        Date: Thu, 23 Oct 2014 20:07:26 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.427 HTML vs XML for TEI -- and TEI Simple
        In-Reply-To: <20141023051908.587546CA7 at digitalhumanities.org>


I would like to add this observation:

Martin Holmes said:

There are no implications for rendering whatsoever.

> As I said before, we may use that information in the process of rendering
> an online facsimile edition
>

I find it impossible to reconcile these two statements.

If in <zone> there are "no implications for rendering whatsoever" how can
you then use, even sometimes,"that information in the process of
rendering"? And when not used for rendering, what is its purpose? Surely
only to be ignored.

Desmond Schmidt
Queensland University of Technology



More information about the Humanist mailing list