[Humanist] 28.414 HTML vs XML for TEI

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sun Oct 19 08:43:55 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 414.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Sat, 18 Oct 2014 18:52:34 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.409 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141018053716.5026A5FB7 at digitalhumanities.org>


Dear Hugh and Martin,

Please forgive me for attempting to pick apart your arguments, but I
think that readers of Humanist have a right to have the facts unravelled,
and judge for themselves.

1. HTML is presentation-oriented whereas XML describes logical structure.

Direct formatting of elements in modern HTML has long been deprecated.
CSS formatting cleanly separates rendition from textual structure in
ways that XML cannot match. Of course, XML is *supposed* to store the
clean logical structure of the source documents, but as I have been
informed on numerous occasions when reviewing other people's TEI-XML,
embedding end-result related information directly into the TEI source is
now common and accepted practice. Just take a look at the examples
in the Guidelines for <surface> and <zone> some time.

HTML *started out* as an explicit presentation format, but it has become
the lingua franca of mixed content on the Web, and has been extended
with powerful mechanisms for expressing trillions of documents of all
kinds. I find it hard to believe that it could not also express the
documents that digital humanists seek to record. Are we so different?

2. Using HTML instead of XML would solve no problems

Indeed it does solve at least one big problem: it would make TEI-encoded
texts interoperable across thousands of applications that already
understand HTML. At the moment, given the immense variation in the
selection and application of TEI tags, not only is interoperability
impossible, but even interchange (that is, lossy conversion in order to
re-use a document) is difficult without prior agreement. How are we
supposed to work together when the language through which we communicate
actually impedes collaboration?

3. HTML is less stable than TEI-XML

HTML is in its fifth standardised definition in 22 years. In that time TEI
has also had five major revisions, but numerous incremental changes. P5
has gone through *23* revisions since 2007. Furthermore, HTML is strictly
standardised by the ISO and W3C. Software vendors are at a disadvantage
if they attempt to deviate from the standard. On the other hand, users of
TEI are positively encouraged to customise and extend TEI to suit their
needs.

4. XML is a better archiving format than HTML

This follows naturally from point 3: the more stable a format the better
it is for archiving.

5. TEI is easier to type than HTML

The exact format of the TEI part is not yet decided, so examples that
compare verbosity are not feasible.

Also, many people understand HTML already, but have to be trained to
use TEI-XML. In my experience of supervising such work, encoders make
many mistakes that take years to unlearn, and have to be constantly
corrected. As a result, keeping the text consistent, even within a single
project, is extremely difficult.

As Hugh admits:

> It may now, given the current state of the technology, be possible to
> sensibly express TEI in HTML

Indeed. In that case I ask again, why don't we do it, and all talk to
each other in the language of the Web?

Desmond Schmidt
Queensland University of Technology

On Sat, Oct 18, 2014 at 3:37 PM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 28, No. 409.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>   [1]   From:    Hugh Cayless <philomousos at gmail.com>
>   (84)
>         Subject: Re:  28.404 PostGreSQL and Solr for digital archives
>
>   [2]   From:    Martin Holmes <mholmes at uvic.ca>
>  (11)
>         Subject: RE:  28.404 PostGreSQL and Solr for digital archives
>
>
>
> --[1]------------------------------------------------------------------------
>         Date: Fri, 17 Oct 2014 09:52:31 -0400
>         From: Hugh Cayless <philomousos at gmail.com>
>         Subject: Re:  28.404 PostGreSQL and Solr for digital archives
>         In-Reply-To: <20141017064531.B75ED622A at digitalhumanities.org>
>
>
> Dear Desmond,
>
> I am one of those who believe that TEI should eventually move toward
> defining an abstract model, expressible in a variety of serializations (XML
> being one of those). I don't think that view is particularly heretical
> among users of the TEI.
>
> That being said, however, a move like the one you suggest isn't feasible
> for a variety of reasons. HTML is primarily a language for visually
> formatting text+other media. TEI is primarily for encoding the semantics of
> text+other media. This means there are a number of mismatches between the
> TEI Way and the HTML Way which make such a 1::1 conversion very
> difficult. It may now, given the current state of the technology, be
> possible to sensibly express TEI in HTML but that doesn't solve many
> problems by itself. Work is underway to define a "Simple" expression of TEI
> that has both a data model and a processing model (e.g. TEI elements will
> have formatting conventions) and this will, I hope, be a stepping stone
> towards the goal I mentioned in the first paragraph, but there's a lot of
> work to be done yet.
>
> Furthermore, because of its nature, HTML is a moving target in ways that
> TEI isn't. Having had the experience of migrating old (ca. 10 years or
> more) TEI SGML/XML collections and old HTML collections, I can tell you the
> TEI is *vastly* easier to deal with. It makes for a much better archival
> format. Constraints are good from this perspective, and HTML has very few
> constraints.
>
> "Interchange" and "interoperable" are superficially simple concepts, but
> the reality is very different. Interchange might mean many different things
> in different contexts. Adhering to common standards such as TEI and XML
> makes interchange *possible*, but nothing is going to make it
> plug-and-play.
>
> Lastly, I don't really see a problem with a publishing workflow that has at
> its core files most users won't access. The TEI files aren't themselves the
> deliverable to users, it's the viewing and discovery interfaces that they
> support which most users will want. To have that and a format you can build
> a sensible editorial workflow around *and* a decent archival format that
> preserves a great deal of the interpretive work that went into the files'
> creation seems like a huge win to me.
>
> All the best,
> Hugh
>
> On Fri, Oct 17, 2014 at 2:45 AM, Humanist Discussion Group <
> willard.mccarty at mccarty.org.uk> wrote:
>
> >                  Humanist Discussion Group, Vol. 28, No. 404.
> >             Department of Digital Humanities, King's College London
> >                        www.digitalhumanities.org/humanist
> >                 Submit to: humanist at lists.digitalhumanities.org
> >
> >
> >
> >         Date: Thu, 16 Oct 2014 21:08:43 +1000
> >         From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
> >         Subject: Re:  28.400 PostGreSQL and Solr for digital archives
> >         In-Reply-To: <20141016064745.23806622A at digitalhumanities.org>
> >
> >
> > Hi Martin,
> >
> > I'd like to expand the discussion a bit, but my point of departure is
> > your remark that: "little if anything of the TEI encoding is actually
> > available to the user". The technical reason for this is, of course,
> > that these applications do not intrinsically support XML, although they
> > can import it. But the underlying reason is that we encoded the XML
> > through the exercise of human judgement and interpretation. It should
> > then come as no surprise that some of that information gets lost when it
> > is read by a machine.
> >
> > What I would like to suggest as a remedy to this situation is that we
> > stop trying to share our data on the *basis* of human-determined tags.
> > Instead we could use HTML and encode the interpretative part as class
> > attributes or as RDFa or microformats. TEI could become an *abstract*
> > set of names defining textual properties, without reference to any
> > specific technology. One way of recording and expressing those
> > properties could be via HTML. If we did that then everyone's files would
> > be interoperable because they would already be in the language of the
> > Web.
> >
> > Of course, we can convert the XML to HTML whenever we want, but we don't
> > seek to share it in that form, we seek instead to share the XML, and we
> > can't, because TEI-XML is not interoperable. And yet, there is nothing
> > in TEI-XML that can't be expressed in some alternative way in HTML.
> > Especially since, according to the <a
> > href="http://jtei.revues.org/372">recent survey by Burghart</a>, 97% of
> > TEI-encoded texts of manuscripts (and probably a similar proportion of
> > printed texts) just get converted into HTML anyway. So please explain to
> > me why we need to use XML, because I really don't see it.
> >
> > Desmond Schmidt
> > Queensland University of Technology
>
>
>
> --[2]------------------------------------------------------------------------
>         Date: Fri, 17 Oct 2014 15:36:42 +0000
>         From: Martin Holmes <mholmes at uvic.ca>
>         Subject: RE:  28.404 PostGreSQL and Solr for digital archives
>         In-Reply-To: <20141017064531.B75ED622A at digitalhumanities.org>
>
>
> Hi Desmond,
>
> One obvious reason for encoding initially in TEI (regardless of what you
> end up producing) is that you can use the excellent Guidelines and easily
> create customized schemas for your project which express your theoretical
> approaches and constrain your practice to match them. YOu could of course
> do this in HTML too, but this:
>
> <span class="tei_persName"><span class="tei_forename">Fred></span> <span
> class="tei_surname">Bloggs</span></span>
>
> is a lot less friendly and human-readable than this:
>
> <persName><forename>Fred></forename> <surname>Bloggs</surname></persName>
>
> Cheers,
> Martin
>
> Martin Holmes
> mholmes at uvic.ca
> martin at mholmes.com
> mholmes at halfbakedsoftware.com






More information about the Humanist mailing list