[Humanist] 28.442 HTML vs XML for TEI

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Oct 27 08:49:45 CET 2014


                 Humanist Discussion Group, Vol. 28, No. 442.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (653)
        Subject: Re:  28.437 HTML vs XML for TEI

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (620)
        Subject: Re:  28.437 HTML vs XML for TEI


--[1]------------------------------------------------------------------------
        Date: Mon, 27 Oct 2014 06:20:09 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.437 HTML vs XML for TEI
        In-Reply-To: <20141025081254.61A147829 at digitalhumanities.org>


Hi Martin,

this is going to be a quick email, because I have run out of time to
continue this discussion. So I'll answer all the current points and then
stop. What it has revealed to me is the need to research this topic
further, and that it cannot be resolved by a discussion on Humanist.

> XML will be superceded by something better; but most of us don't believe
> that is close, and in fact, as schema, query and transformation languages
> and tools keep getting better, XML is actually working better for us all
> the time.

It rather depends on who you ask. If you ask someone who has a vested
interest in the continuation of XML then of course they are going to say
these kinds of things. I only go by what James Clark said in the link I
sent earlier (http://blog.jclark.com/2010/11/xml-vs-web_24.html), that XML
is being replaced by JSON as a Web technology. He said that 4 years ago and
this has pretty much happened since.

As to the question of it as a mixed content technology I see it as
dependent on the Web use of XML. I don't want to crystal-ball gaze but for
those of us who have seen big technologies come and go before the pattern
is clear.

"So what's the way forward? I think the Web community has spoken, and it's
clear that what it wants is HTML5, JavaScript and JSON. "(note the absence
of XML)

And he should know: he invented the name XML, wrote the first XML-parser (I
think), and XSLT. He's Mr XML if anyone is. Humanists need to be insulated
from that kind of change. That is why I was urging that TEI be made
abstract, not bound to changeable technology.

> In HTML, there is no prescribed or recommended way of encoding a line of
> verse; we would all have to make up our own systems

A good point. The question is, does it really matter? What are we going to
do with that information? I'm not sure that semantically knowing that
something is a line is going to help us "understand" anything about the
text. Probably all we want to do is format it.

> TEI is huge, and one of the first things we do when starting a TEI project
> is to further constrain it so that it contains only what we need.

Even when you "constrain it" you're still defining a specific markup
language that hasn't existed before. Once you do that you are required to
write software that can process that language. You might get by with a
generic stylesheet, but for any anything more you're binding the encoding
to custom software. So forget about interoperability from that moment on.

> Images are texts; most of us have seen them as texts for a long time.
> Here's an example of an image which is a text:

Yes, every text is also an image. But the difference here is the humanistic
way of classifying of images versus text and the software engineer's
perspective. The TEI Guidelines are software. To make software work well
requires that every part has a clearly defined role. The markup of <zone>
does not refer to any feature of text. It describes a region in an image.
The reference to the image file itself requires the existence of that file
in the environment. Giving the transcription to someone else without that
file, at that resolution, with that format will break the  transcription.
It's not logical but highly specific markup that treats XML like a
word-processor.

> Such information should be external to the textual
> > surrogate, not part of it.
>
> I'm not sure I understand this.

If I say <pb n="5"/> that's abstract.The environment in which the textual
surrogate is used can supply a file linked to the "5". It can even be in a
database. It can be any size, any resolution, any format. If I say <pb
facs="0005.jpg"/> Suddenly all that generality goes out the window. In the
same way the <zone> element and its <graphic url="graphic.png "/> are not
general specifications.They impose a link between the transcription and its
external environment that makes it brittle, not reusable.

> But any comparison on XML and JSON is really apples and oranges anyway;
> they have different uses and purposes

XML is mostly a Web technology. It's also a mixed-content technology, and
that's where the apples and oranges come in. No one is suggesting that JSON
is a substitute for XML in that role. As a Web technology it's more like
the different between Granny-Smiths and Russets.

> My main interest in the TEI Simple proposal personally is probably going
> to enrage you even further: it promises to provide a mechanism for formally
> specifying a processing model for a TEI ODD.

Oh no, I'm not enraged at all. I'm smiling. Tying it down to one processing
model defeats the whole purpose of using XML in the first place. But what 
intrigues me is the difference between yourself and Martin Mueller. You 
want more TEI elements and he wants less.

> But its abstract model is the embodiment of an ongoing debate in a large
> community about what the salient features, components and aspects of
> "texts" are; and that model (currently) claims that there is something we
> might call a "paragraph", and that it's recommended that you encode it with
> a <p></p> if you're using TEI XML, so we all understand what you mean.
>

It's not abstract. It's specified in XML. A large slice of the Guidelines
are dedicated to formatting the Guidelines themselves. To make it abstract
you'd have to write the abstract specification and then provide
implementations in XML and other formats.

> TEI Tite had a specific audience:

And the audience of TEI Simple is what exactly? 19th century OCRed American
and German books? It should be much more. If it is included in the TEI
Guidelines for all to use, you have to consider the needs of our new
colleagues in South America, Mexico, India, Japan and the rest of Europe. I
thought GO::DH had changed all this.

See you in Sydney!

Desmond Schmidt
Queensland University of Technology



--[2]------------------------------------------------------------------------
        Date: Mon, 27 Oct 2014 11:09:57 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.437 HTML vs XML for TEI
        In-Reply-To: <20141025081254.61A147829 at digitalhumanities.org>


Hi Hugh,

>
> There’s a disconnect between RDF and structured markup that makes me think
> such a mapping would not be trivial, so again, you’re underestimating the
> level of difficulty involved. But leaving that aside, an XML-based workflow
> means a single source document can be used to produce (for example) one or
> more HTML views, indices, documents for indexing in search engines (e.g.
> Solr), print-ready documents, and RDF for Linked Data. The workflow story
> with HTML isn’t so clear, likely because HTML is usually a destination
> format, not a source format. So you’re arguing for doing a huge amount of
> work in order to migrate to a less-usable format. I don’t rule out the
> current of affairs changing, but it’s what we face now. I believe in
> incremental development, not throwing out working processes in favor of
> theoretical shiny things.
>

On second thoughts I kind of agree with you. Except that  I don't see XML
as a "working process". The spin I would put on "shiny new things" is that
we need to work backwards from user needs, not forwards from what we
already have. Technical realisations are a secondary consideration. I
thought that microformats or a basic use of RDFa would do the trick.
Because all we need to say is "this piece of text has this property. I'm
not sure we want to reason from TEI texts, or not in the way envisaged by
RDF.

> interoperability might be a goal of a specific customization of TEI, but
> it’s not something I’d be interested in imposing on TEI as a whole. People
> want to do different things with different kinds of text.

I'm sorry I don't follow the logic here.  A specific customisation of TEI
is by definition not interoperable. People definitely want to collaborate,
not so much to share texts but to build tools that work across texts.
That's the real advantage of interoperability.

> That kind of error checking is completely absent for HTML/RDFa. With
> TEI-in-HTML you’d have about 50 flavors of <span>. How would we keep them
> straight?

I think you're exaggerating a tad. HTML5 has its own syntax. (
www.w3.org/TR/html-markup/syntax.html) I'm not suggesting that we deviate
from that.

> I’ve just replaced my supplied tag with something like <span typeof="
> http://www.tei-c.org/ns/1.0#supplied" data-reason="lost">this</span> (and
> incidentally, it could not be so simple if we’re really using RDFa)

What's wrong with <span property="tei:supplied-lost">this</span>? That
validates in http://rdfa.info/play/. I'm not sure you need the typeof
attribute in this example. The question is, what do you want to say about
that property? What does it belong to? Not to itself, as you are saying
here. We can combine "supplied" and "lost" in this case since there will be
no practical limit on the number of properties. Or we can just use
microformats.

> What I’m seeing in your argument is a desire to impose order on an
> ecosystem from the top down. There is always this tension between the need
> to standardize and the need to customize—by the latter I don’t mean
> necessarily to alter the specification itself, but to choose to mark
> certain features of a text and not others. If I understand your arguments,
> you feel TEI provides too much flexibility and would be better expressed in
> a format that is more general and has less expressive scope, but is easy to
> work with, particularly from a web-publishing perspective. I don’t think
> that’s an unreasonable opinion, but I hope you’ll forgive me if I don’t
> share it.

Consider this: TEI as an abstract specification, not bound to any
technology, with all the cruft cleaned out, a specification agreed to by
the full *community* of digital humanists, not just the Americans, British
and a few Europeans, or paid-up members. It would be a specification that
could be realised in three forms: a) XML, b) HTML+microformats/RDFa or
whatever c) plain text+external markup and any other forms or technologies
that come into being in future. That would be useful to a much wider range
of people than at present. And no, I don't think it is a pipe dream. It is
realisable if you try.

FYI I have an urgent appointment in London that I must prepare for, so I
can't afford to respond to your reply in enough time. The good news is that
*you* get to have the last say, if you want. I promise I'll read it.

Desmond Schmidt
Queensland University of Technology






More information about the Humanist mailing list