[Humanist] 28.419 HTML vs XML for TEI

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Oct 20 08:05:20 CEST 2014

                 Humanist Discussion Group, Vol. 28, No. 419.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Martin Holmes <mholmes at uvic.ca>                          (116)
        Subject: Re: [Humanist] 28.414 HTML vs XML for TEI

  [2]   From:    Hugh Cayless <philomousos at gmail.com>                     (104)
        Subject: Re:  28.414 HTML vs XML for TEI

        Date: Sun, 19 Oct 2014 05:56:50 -0700
        From: Martin Holmes <mholmes at uvic.ca>
        Subject: Re: [Humanist] 28.414 HTML vs XML for TEI
        In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>

Hi Desmond,

On 14-10-18 11:43 PM, Humanist Discussion Group wrote:
> Dear Hugh and Martin,
> Please forgive me for attempting to pick apart your arguments, but I
> think that readers of Humanist have a right to have the facts unravelled,
> and judge for themselves.
> 1. HTML is presentation-oriented whereas XML describes logical structure.
> Direct formatting of elements in modern HTML has long been deprecated.
> CSS formatting cleanly separates rendition from textual structure in
> ways that XML cannot match. Of course, XML is *supposed* to store the
> clean logical structure of the source documents, but as I have been
> informed on numerous occasions when reviewing other people's TEI-XML,
> embedding end-result related information directly into the TEI source is
> now common and accepted practice. Just take a look at the examples
> in the Guidelines for <surface> and <zone> some time.

I think it's important to distinguish between detailed description of 
what a source text looks like (for which we use CSS all the time, 
because it's a great tool for the job), and information intended for the 
rendering of a text. We do not (or we should not) use any of the current 
TEI mechanisms for embedding information descriptive of typographical 
features, layout, etc. in a text for embedding output or processing 
information. If you know of any such examples in the Guidelines, please 
report them as a bug.

That's not to say, of course, that we may not use embedded information 
describing the source text to render it for a reader in a similar way; 
but that's not its purpose.

> HTML *started out* as an explicit presentation format, but it has become
> the lingua franca of mixed content on the Web, and has been extended
> with powerful mechanisms for expressing trillions of documents of all
> kinds. I find it hard to believe that it could not also express the
> documents that digital humanists seek to record. Are we so different?

No. But HTML5 (for example) has (I think) fewer than 120 elements, 
whereas TEI has nearly 600, and we deal with a steady flow of requests 
for more. TEI has much more descriptive power than HTML. And I would 
claim that it's much easier to use this descriptive power in TEI XML 
than it would be to hack it into HTML; to do that, you would have to 
decide whether (for instance) a <nationality> element should be 
expressed as a <span> or a <div> or a <p> in HTML, and what its 
distinguishing attribute should be; and in order to make the results 
truly interoperable, we would all have to agree on those things, and 
document them, and create a new Guidelines document to formalize them. 
The result would be more verbose and less human-readable.

> 2. Using HTML instead of XML would solve no problems
> Indeed it does solve at least one big problem: it would make TEI-encoded
> texts interoperable across thousands of applications that already
> understand HTML. At the moment, given the immense variation in the
> selection and application of TEI tags, not only is interoperability
> impossible, but even interchange (that is, lossy conversion in order to
> re-use a document) is difficult without prior agreement. How are we
> supposed to work together when the language through which we communicate
> actually impedes collaboration?

I don't think it would, for the reasons I mention above. It might make 
it more immediately renderable in web browsers; but a simple CSS 
stylesheet can do that for TEI anyway.

> 3. HTML is less stable than TEI-XML
> HTML is in its fifth standardised definition in 22 years. In that time TEI
> has also had five major revisions, but numerous incremental changes. P5
> has gone through *23* revisions since 2007. Furthermore, HTML is strictly
> standardised by the ISO and W3C. Software vendors are at a disadvantage
> if they attempt to deviate from the standard. On the other hand, users of
> TEI are positively encouraged to customise and extend TEI to suit their
> needs.

Most users, when customizing, don't _extend_ TEI; they actually 
constrain it further. I don't have a single project in which I've 
introduced anything new into my TEI schema (and I have a lot of TEI 
projects); all my TEI files should validate against tei_all (the 
comprehensive TEI schema). We also make every effort to avoid breaking 
backwards compatibility in any updates to TEI; we have very rarely done 
it, and in all the cases I can think of since I've been on the TEI 
Council, only after determining that the change will not materially 
affect more than a handful of existing users (or no-one at all).

> 4. XML is a better archiving format than HTML
> This follows naturally from point 3: the more stable a format the better
> it is for archiving.

It's very good practice to generate as many different output formats as 
you can for reasons of archiving, survivability and 
interoperability/interchange. TEI is a good one, because it comes with 
built-in detailed documentation (if you've done a good job with your ODD 
file). But HTML is another good one, for different purposes. It's easier 
to produce HTML from TEI than the reverse.

> 5. TEI is easier to type than HTML
> The exact format of the TEI part is not yet decided, so examples that
> compare verbosity are not feasible.
> Also, many people understand HTML already, but have to be trained to
> use TEI-XML. In my experience of supervising such work, encoders make
> many mistakes that take years to unlearn, and have to be constantly
> corrected. As a result, keeping the text consistent, even within a single
> project, is extremely difficult.

In all my years of teaching people HTML, TEI and other languages, I've 
never come across a single person younger than me who has had any 
problem learning it; nowadays, given all the extra aids we have from 
tools such as Oxygen, it takes a remarkably short time to get productive 
with TEI. I've come across a few older people who claim they find it 
hard, but generally before they've actually made any effort to learn it.

> As Hugh admits:
>> It may now, given the current state of the technology, be possible to
>> sensibly express TEI in HTML
> Indeed. In that case I ask again, why don't we do it, and all talk to
> each other in the language of the Web?

Because we'd have to rewrite the TEI Guidelines in HTML in order to 
ensure we're all using the same expressions in HTML for each of the 
things that are already well-described in TEI; and the result would be 
harder to maintain and transform into other things.


        Date: Sun, 19 Oct 2014 14:03:43 -0400
        From: Hugh Cayless <philomousos at gmail.com>
        Subject: Re:  28.414 HTML vs XML for TEI
        In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>

Dear Desmond,

Your "unravelling" seems to me more like erecting strawmen that vaguely resemble Martin’s and my responses, which you can then easily knock over. This is not arguing in good faith. So while I’m happy and interested to discuss these things, I’m not prepared to do so under those conditions. If you want to have a serious discussion, I’m up for it, but if not, this will probably be my last word on the subject.

1. CSS, wonderful as it is, does not transform HTML automatically into a semantic markup language. As proofs, I give you HTML’s seven (!) levels of header, its <p> tag that isn’t actually a paragraph, and it’s ordered vs. unordered (!) lists. I will buy you a beer if you can explain to me how a make a list that isn’t ordered. HTML has taken some steps in a semantic direction, certainly, but its basic nature hasn’t changed.

That said, I do think an HTML flavor of TEI is possible, and in fact steps towards this are being taken. It is not a simple thing, however. And I suspect it won’t look very much like typical HTML.

2. What do you mean by "interoperable"? I’ll simply quote my previous point, as you didn’t address it:

>> "Interchange" and "interoperable" are superficially simple concepts, but
>> the reality is very different. Interchange might mean many different things
>> in different contexts. Adhering to common standards such as TEI and XML
>> makes interchange *possible*, but nothing is going to make it
>> plug-and-play.

I will add that texts are complex things, and what you want to do with yours might be quite different from what I want to do with mine. What does it mean for a Shakespeare play and a documentary papyrus to be "interoperable"? I have no idea.

3. Standards are lovely. Have you ever *seen* HTML in the wild? It’s tag soup. As I said previously, HTML has very few constraints. Yes, you can validate it, but hardly anyone does. What passes for validation is, "Does it look right in a browser? Yes? Ship it!" TEI XML has many more constraints than HTML, and I will repeat my point that this makes it easier to work with during its creation, for anyone wanting to re-use it, and from the perspective of digital preservation. In order to get the same affordances in HTML, we’d have to do a *huge* amount of work, and it might not be possible at all, because progress in HTML, for all that it’s a standard, is driven by big companies, not us poor digital humanists.

4. Sorry, does not follow, because your point #3 does not stand up to scrutiny.

5. I’m not really swayed that much by arguments about verbosity, because there are usually technological solutions to them, for example editors with autosuggest features or domain-specific markup languages with a minimal and restricted syntax which can be expanded to the standard serialization. But: the vast majority of people who create HTML content are doing so in a WYSIWYG editor or with Markdown vel sim.. They don’t actually know HTML. What makes you think TEI-equivalent HTML would be any easier for them to learn than TEI XML? Martin is quite right that as of now, editing TEI XML is much easier than it would be to edit HTML with equivalent semantics. If your TEI encoders are making certain mistakes consistently, it’s very easy to write Schematron rules that will flag those and make the document invalid until they are corrected. This kind of workflow isn’t hard to manage in TEI, which is one reason people use it.

In short, XML has lots of advantages that currently make it better suited to the creation and publication of TEI texts than HTML. That might change in the future. TEI will, if it survives, not always be expressed only in XML. But you’re arguing as if a switch to HTML would be easy, and it simply ain’t so. 

All the best,

More information about the Humanist mailing list