[Humanist] 28.427 HTML vs XML for TEI -- and TEI Simple

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Oct 23 07:19:08 CEST 2014

                 Humanist Discussion Group, Vol. 28, No. 427.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        From: Humanist Discussion Group <willard.mccarty at mccarty.org.uk>

                 Humanist Discussion Group, Vol. 28, No. 426.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Ken Kahn <toontalk at gmail.com>                            (867)
        Subject: Re:  28.421 HTML vs XML for TEI

  [2]   From:    Martin Holmes <mholmes at uvic.ca>                           (95)
        Subject: RE:  28.421 HTML vs XML for TEI

  [3]   From:    Martin Mueller <martinmueller at northwestern.edu>           (13)
        Subject: TEI Simple and HTML vs XML for TEI

        Date: Tue, 21 Oct 2014 14:41:32 +0100
        From: Ken Kahn <toontalk at gmail.com>
        Subject: Re:  28.421 HTML vs XML for TEI
        In-Reply-To: <41baf279-455e-4d3a-8d91-cf58208dc423 at HUB03.ad.oak.ox.ac.uk>

Apropos the future of XML I found it interesting to see Google Trends (http://www.google.com/trends/explore#q=xml) which shows a steady decline in
searches containing XML to 1/6th of what it was 10 years ago.

-ken kahn

        Date: Tue, 21 Oct 2014 15:25:48 +0000
        From: Martin Holmes <mholmes at uvic.ca>
        Subject: RE:  28.421 HTML vs XML for TEI
        In-Reply-To: <20141021051843.538966715 at digitalhumanities.org>

Hi Desmond,

>> That's not to say, of course, that we may not use embedded information
>> describing the source text to render it for a reader in a similar way;
>> but that's not its purpose
> No, as I said before that is how it is *supposed* to be. My point was
> simply that
> between the dream and the reality falls the shadow.
> Please explain in what way the TEI Guidelines' example of <zone> is *not*
> embedding end-result related information into the transcription when it
> reads like this:
> <surface ulx="14.54" uly="16.14" lrx="0"
>  lry="0">
>  <graphic url="stone.jpg"/>
>  <zone points="4.6,6.3 5.25,5.85 6.2,6.6 8.19222,7.4125 9.89222,6.5875
> 10.9422,6.1375 11.4422,6.7125 8.21722,8.3125 6.2,7.65"/>
> </surface>
> That's pure formatting dressed up as 'logical' markup.
> I don't believe that TEI-XML has any more or less end-result related
> content than HTML. Not any more.

I think you must be misunderstanding the purpose of surface/zone markup. The idea here is to be able to link areas on images (typically page-images in a facsimile) to other aspects of markup; so, for example, one might define a zone outlining a stanza in a poem, and link that to a transcription of the poem encoded using <lg> and <l>. There are no implications for rendering whatsoever.

As I said before, we may use that information in the process of rendering an online facsimile edition (for example); but all it's actually saying is: here is a shape on the page-image, with an @xml:id.

> But HTML5 (for example) has (I think) fewer than 120 elements,
>> whereas TEI has nearly 600, and we deal with a steady flow of requests
>> for more. TEI has much more descriptive power than HTML. And I would
>> claim that it's much easier to use this descriptive power in TEI XML
>> than it would be to hack it into HTML; to do that, you would have to
>> decide whether (for instance) a <nationality> element should be
>> expressed as a <span> or a <div> or a <p> in HTML, and what its
>> distinguishing attribute should be; and in order to make the results
>> truly interoperable, we would all have to agree on those things, and
>> document them, and create a new Guidelines document to formalize them.
>> The result would be more verbose and less human-readable.
> I'm a firm believer in Browning's popular dictum 'less is more'. And less
> *is* more
> because those mere 120 elements describe far more documents of far wider
> variety
> than TEI-XML.

They certainly do, but they're not interoperable on any meaningful way other than that a browser can display them. If I encode my poetic lines with HTML <div> elements, you encode them with <span>s and a third person delimits them with <br/> tags (all perfectly reasonable choices), our three texts are no interoperable at all.

> As for deciding what HTML element to map an XML tag to you do that all the
> time
> in XSLT, don't you?

Absolutely; and it's a very lossy conversion whose purpose is display.

> I never said HTML was a perfect container, but it is a
> real, not a
> de facto standard, and augmenting it with RDFa would give you a rich and
> standards-compliant way to record semantic properties of text.
> Yes, you'd have to rewrite the Guidelines as a set of standard properties,
> but it would be less
> complex by far, not more. Especially as in that form, most of the 600 tags
> would not
> be needed.

But distinct methods of encoding all the features required by users who need those tags _would_ be required. The reason those tags and attributes exist is that (for the most part) people need them and use them.

> Most users, when customizing, don't _extend_ TEI; they actually
>> constrain it further.
> I quote the TEI Guideliines chapter 23.3:
> "the TEI scheme may be extended in well-defined and documented ways
>  for texts that cannot be conveniently or appropriately encoded using what
> is provided. For these reasons, it is almost impossible to use the TEI
> scheme
> without customizing or personalizing it in some way. "

Certainly; and the vast majority of those customizations are in the direction of greater constraint; this allows each project to work with a much smaller subset of TEI which is all they actually need, while still remaining conformant. And this is one reason why teaching people XML encoding these days is a lot easier; the project schema guides almost every decision they make.

> The 125,000 EEBO-TCP books encoded in a customised version of TEI p3
> were I gather not TEI-conformant.

That's before my time, so I don't know why it's the case, but I gather that P5-conformant versions are now available.

> In all my years of teaching people HTML, TEI and other languages, I've
>> never come across a single person younger than me who has had any
>> problem learning it; nowadays, given all the extra aids we have from
>> tools such as Oxygen, it takes a remarkably short time to get productive
>> with TEI. I've come across a few older people who claim they find it
>> hard, but generally before they've actually made any effort to learn it.
> I think I have enough experience after supervising encoders of primary texts
> for perhaps 20 years in 3 countries. I stand by my remarks.You cannot get
> *consistency* in TEI markup without a great deal of effort. It's just too
> complicated.

If you're not taking full advantage of the customization and schema constraint features offered by ODD and modern schema languages such as RelaxNG and Schematron, then it's substantially more difficult than it needs to be, for sure.

All the best,
Martin Holmes
mholmes at uvic.ca
martin at mholmes.com
mholmes at halfbakedsoftware.com

        Date: Tue, 21 Oct 2014 20:40:26 +0000
        From: Martin Mueller <martinmueller at northwestern.edu>
        Subject: TEI Simple and HTML vs XML for TEI
        In-Reply-To: <20141021051843.538966715 at digitalhumanities.org>

The exchange between Desmond Schmidt, Hugh Cayless, and Martin Holmes is helpful in articulating long-standing TEI problems—whether real or perceived, and the perceived problems are often harder to deal with.  It also comes at a timely moment because it does a good job of highlighting problems that the TEI Simple project is trying to address. For convenience' sake I reproduce below an announcement you may have seen before:

Northwestern University is pleased to announce a matching grant from the Andrew W. Mellon Foundation for the development of TEI Simple, which seeks to lower the entry barriers to working with TEI documents by combining a new highly constrained  and prescriptive subset of the Text Encoding Initiative Guidelines with a  a "cradle to grave" processing model that associates the TEI Simple schema with explicit and standardized options for displaying and querying texts.    A major driver for this project has been the imminent release into the public domain of some 25,000 TEI-encoded texts from Early English Books Online (EEBO), but  the project aims more broadly at creating a friendlier and more interoperable environment for working with digital surrogates of books in European languages from the Early modern period into the 20th century.

The grant of $51,500 matches contributions of $68,000 in time or money from the Centre for Digital Research in the Humanities at the University of Nebraska-Lincoln, the University of Oxford, the TEI Consortium, and Northwestern University.

The principal investigators of the project are Sebastian Rahtz (Oxford), Brian Pytlik Zillig (Nebraska-Lincoln), and Martin Mueller (Northwestern).  The Advisory Committee for TEI Simple includes representatives from the German Text Archive, Text Grid, and the Bodleian Libraries.

The project  is scheduled for completion by August 2015. Once all its elements are in place, TEI Simple will be fully integrated into the TEI infrastructure,  and the TEI Council will be responsible for its maintenance and further development.

Much of the lively exchange turned on the question of interoperability. I don't have the technical chops to say much of use about that subject, but it has always seemed to me that this is much more a social than a technical question. If I think of the encoding of a text as something that furthers my project and seeks to share my view of the text with readers we are in the world of "quot homines tot sententiae." That is Latin for "as many opinions as heads," and I quote the Latin to make the point that this insight has been around at least  since 200 BCE.   If I think of encoding as an act that in addition to furthering my project also creates a data set that can be used by others for their purposes it is a different story. Coarse but consistent encoding across quite heterogeneous data may achieve useful levels of interoperability, by which I mean that I can reuse your data and expect that they more or less follow a common standard. In this regard Literary Studies may have a lot to learn from the Life Sciences, where the maintenance of sharable data has brought major benefits—although from the perspective of the individual scientist it brings its own headaches.

Future research projects in the text centric humanities are likely to involve a lot of "mix and match" approaches, where a doctoral student assembles hundreds, perhaps even thousands, of texts from different archives and wants to take advantage of the query potential that comes with TEI encoding. She wants to spend as little time as possible on the "janitor work" that according to the New York Times is the "key hurdle to insights" for Big-Data scientists (http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?module=Search&mabReward=relbias%3As%2C%7B%221%22%3A%22RI%3A11%22%7D) She is always going to spend more time on this than she had planned—that is the nature of the beast. But if readily sharable data are a shared goal, there will be less of that  "more" .

That said, I remember a life scientist who laughed when the subject of shared data came up and said that most scientists would rather share their toothbrushes than their data.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

More information about the Humanist mailing list