[Humanist] 28.421 HTML vs XML for TEI

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Oct 21 07:18:43 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 421.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (328)
        Subject: Re:  28.419 HTML vs XML for TEI

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (379)
        Subject: Re:  28.419 HTML vs XML for TEI


--[1]------------------------------------------------------------------------
        Date: Tue, 21 Oct 2014 00:41:33 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.419 HTML vs XML for TEI
        In-Reply-To: <20141020060521.022C9660A at digitalhumanities.org>


Hi Martin,

I'll reply to your email separately to keep this short

We do not (or we should not) use any of the current
> TEI mechanisms for embedding information descriptive of typographical
> features, layout, etc. in a text for embedding output or processing
> information. If you know of any such examples in the Guidelines, please
> report them as a bug.
>
> That's not to say, of course, that we may not use embedded information
> describing the source text to render it for a reader in a similar way;
> but that's not its purpose
>

No, as I said before that is how it is *supposed* to be. My point was
simply that
between the dream and the reality falls the shadow.
Please explain in what way the TEI Guidelines' example of <zone> is *not*
embedding end-result related information into the transcription when it
reads like this:

<surface ulx="14.54" uly="16.14" lrx="0"
 lry="0">
 <graphic url="stone.jpg"/>
 <zone points="4.6,6.3 5.25,5.85 6.2,6.6 8.19222,7.4125 9.89222,6.5875
10.9422,6.1375 11.4422,6.7125 8.21722,8.3125 6.2,7.65"/>
</surface>

That's pure formatting dressed up as 'logical' markup.
I don't believe that TEI-XML has any more or less end-result related
content than HTML. Not any more.

But HTML5 (for example) has (I think) fewer than 120 elements,
> whereas TEI has nearly 600, and we deal with a steady flow of requests
> for more. TEI has much more descriptive power than HTML. And I would
> claim that it's much easier to use this descriptive power in TEI XML
> than it would be to hack it into HTML; to do that, you would have to
> decide whether (for instance) a <nationality> element should be
> expressed as a <span> or a <div> or a <p> in HTML, and what its
> distinguishing attribute should be; and in order to make the results
> truly interoperable, we would all have to agree on those things, and
> document them, and create a new Guidelines document to formalize them.
> The result would be more verbose and less human-readable.

I'm a firm believer in Browning's popular dictum 'less is more'. And less
*is* more
because those mere 120 elements describe far more documents of far wider
variety
than TEI-XML.
As for deciding what HTML element to map an XML tag to you do that all the
time
in XSLT, don't you? I never said HTML was a perfect container, but it is a
real, not a
de facto standard, and augmenting it with RDFa would give you a rich and
standards-compliant way to record semantic properties of text.
Yes, you'd have to rewrite the Guidelines as a set of standard properties,
but it would be less
complex by far, not more. Especially as in that form, most of the 600 tags
would not
be needed.

Most users, when customizing, don't _extend_ TEI; they actually
> constrain it further.
>

I quote the TEI Guideliines chapter 23.3:

"the TEI scheme may be extended in well-defined and documented ways
 for texts that cannot be conveniently or appropriately encoded using what
is provided. For these reasons, it is almost impossible to use the TEI
scheme
without customizing or personalizing it in some way. "

The 125,000 EEBO-TCP books encoded in a customised version of TEI p3
were I gather not TEI-conformant.

In all my years of teaching people HTML, TEI and other languages, I've
> never come across a single person younger than me who has had any
> problem learning it; nowadays, given all the extra aids we have from
> tools such as Oxygen, it takes a remarkably short time to get productive
> with TEI. I've come across a few older people who claim they find it
> hard, but generally before they've actually made any effort to learn it.
>

I think I have enough experience after supervising encoders of primary texts
for perhaps 20 years in 3 countries. I stand by my remarks.You cannot get
*consistency* in TEI markup without a great deal of effort. It's just too
complicated.

Desmond Schmidt
Queensland University of Technology

On Mon, Oct 20, 2014 at 4:05 PM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>
>                  Humanist Discussion Group, Vol. 28, No. 419.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>   [1]   From:    Martin Holmes <mholmes at uvic.ca>
> (116)
>         Subject: Re: [Humanist] 28.414 HTML vs XML for TEI
>
>   [2]   From:    Hugh Cayless <philomousos at gmail.com>
>  (104)
>         Subject: Re:  28.414 HTML vs XML for TEI
>
>
>
> --[1]------------------------------------------------------------------------
>         Date: Sun, 19 Oct 2014 05:56:50 -0700
>         From: Martin Holmes <mholmes at uvic.ca>
>         Subject: Re: [Humanist] 28.414 HTML vs XML for TEI
>         In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>
>
>
> Hi Desmond,
>
> On 14-10-18 11:43 PM, Humanist Discussion Group wrote:
> >
> > Dear Hugh and Martin,
> >
> > Please forgive me for attempting to pick apart your arguments, but I
> > think that readers of Humanist have a right to have the facts unravelled,
> > and judge for themselves.
> >
> > 1. HTML is presentation-oriented whereas XML describes logical structure.
> >
> > Direct formatting of elements in modern HTML has long been deprecated.
> > CSS formatting cleanly separates rendition from textual structure in
> > ways that XML cannot match. Of course, XML is *supposed* to store the
> > clean logical structure of the source documents, but as I have been
> > informed on numerous occasions when reviewing other people's TEI-XML,
> > embedding end-result related information directly into the TEI source is
> > now common and accepted practice. Just take a look at the examples
> > in the Guidelines for <surface> and <zone> some time.
>
> I think it's important to distinguish between detailed description of
> what a source text looks like (for which we use CSS all the time,
> because it's a great tool for the job), and information intended for the
> rendering of a text. We do not (or we should not) use any of the current
> TEI mechanisms for embedding information descriptive of typographical
> features, layout, etc. in a text for embedding output or processing
> information. If you know of any such examples in the Guidelines, please
> report them as a bug.
>
> That's not to say, of course, that we may not use embedded information
> describing the source text to render it for a reader in a similar way;
> but that's not its purpose.
>
> > HTML *started out* as an explicit presentation format, but it has become
> > the lingua franca of mixed content on the Web, and has been extended
> > with powerful mechanisms for expressing trillions of documents of all
> > kinds. I find it hard to believe that it could not also express the
> > documents that digital humanists seek to record. Are we so different?
>
> No. But HTML5 (for example) has (I think) fewer than 120 elements,
> whereas TEI has nearly 600, and we deal with a steady flow of requests
> for more. TEI has much more descriptive power than HTML. And I would
> claim that it's much easier to use this descriptive power in TEI XML
> than it would be to hack it into HTML; to do that, you would have to
> decide whether (for instance) a <nationality> element should be
> expressed as a <span> or a <div> or a <p> in HTML, and what its
> distinguishing attribute should be; and in order to make the results
> truly interoperable, we would all have to agree on those things, and
> document them, and create a new Guidelines document to formalize them.
> The result would be more verbose and less human-readable.
>
> > 2. Using HTML instead of XML would solve no problems
> >
> > Indeed it does solve at least one big problem: it would make TEI-encoded
> > texts interoperable across thousands of applications that already
> > understand HTML. At the moment, given the immense variation in the
> > selection and application of TEI tags, not only is interoperability
> > impossible, but even interchange (that is, lossy conversion in order to
> > re-use a document) is difficult without prior agreement. How are we
> > supposed to work together when the language through which we communicate
> > actually impedes collaboration?
>
> I don't think it would, for the reasons I mention above. It might make
> it more immediately renderable in web browsers; but a simple CSS
> stylesheet can do that for TEI anyway.
>
> > 3. HTML is less stable than TEI-XML
> >
> > HTML is in its fifth standardised definition in 22 years. In that time
> TEI
> > has also had five major revisions, but numerous incremental changes. P5
> > has gone through *23* revisions since 2007. Furthermore, HTML is strictly
> > standardised by the ISO and W3C. Software vendors are at a disadvantage
> > if they attempt to deviate from the standard. On the other hand, users of
> > TEI are positively encouraged to customise and extend TEI to suit their
> > needs.
>
> Most users, when customizing, don't _extend_ TEI; they actually
> constrain it further. I don't have a single project in which I've
> introduced anything new into my TEI schema (and I have a lot of TEI
> projects); all my TEI files should validate against tei_all (the
> comprehensive TEI schema). We also make every effort to avoid breaking
> backwards compatibility in any updates to TEI; we have very rarely done
> it, and in all the cases I can think of since I've been on the TEI
> Council, only after determining that the change will not materially
> affect more than a handful of existing users (or no-one at all).
>
> > 4. XML is a better archiving format than HTML
> >
> > This follows naturally from point 3: the more stable a format the better
> > it is for archiving.
>
> It's very good practice to generate as many different output formats as
> you can for reasons of archiving, survivability and
> interoperability/interchange. TEI is a good one, because it comes with
> built-in detailed documentation (if you've done a good job with your ODD
> file). But HTML is another good one, for different purposes. It's easier
> to produce HTML from TEI than the reverse.
>
> > 5. TEI is easier to type than HTML
> >
> > The exact format of the TEI part is not yet decided, so examples that
> > compare verbosity are not feasible.
> >
> > Also, many people understand HTML already, but have to be trained to
> > use TEI-XML. In my experience of supervising such work, encoders make
> > many mistakes that take years to unlearn, and have to be constantly
> > corrected. As a result, keeping the text consistent, even within a single
> > project, is extremely difficult.
>
> In all my years of teaching people HTML, TEI and other languages, I've
> never come across a single person younger than me who has had any
> problem learning it; nowadays, given all the extra aids we have from
> tools such as Oxygen, it takes a remarkably short time to get productive
> with TEI. I've come across a few older people who claim they find it
> hard, but generally before they've actually made any effort to learn it.
>
> > As Hugh admits:
> >
> >> It may now, given the current state of the technology, be possible to
> >> sensibly express TEI in HTML
> >
> > Indeed. In that case I ask again, why don't we do it, and all talk to
> > each other in the language of the Web?
>
> Because we'd have to rewrite the TEI Guidelines in HTML in order to
> ensure we're all using the same expressions in HTML for each of the
> things that are already well-described in TEI; and the result would be
> harder to maintain and transform into other things.
>
> Cheers,
> Martin
>
>
>
>
> --[2]------------------------------------------------------------------------
>         Date: Sun, 19 Oct 2014 14:03:43 -0400
>         From: Hugh Cayless <philomousos at gmail.com>
>         Subject: Re:  28.414 HTML vs XML for TEI
>         In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>
>
>
> Dear Desmond,
>
> Your "unravelling" seems to me more like erecting strawmen that vaguely
> resemble Martin’s and my responses, which you can then easily knock over.
> This is not arguing in good faith. So while I’m happy and interested to
> discuss these things, I’m not prepared to do so under those conditions. If
> you want to have a serious discussion, I’m up for it, but if not, this will
> probably be my last word on the subject.
>
> 1. CSS, wonderful as it is, does not transform HTML automatically into a
> semantic markup language. As proofs, I give you HTML’s seven (!) levels of
> header, its <p> tag that isn’t actually a paragraph, and it’s ordered vs.
> unordered (!) lists. I will buy you a beer if you can explain to me how a
> make a list that isn’t ordered. HTML has taken some steps in a semantic
> direction, certainly, but its basic nature hasn’t changed.
>
> That said, I do think an HTML flavor of TEI is possible, and in fact steps
> towards this are being taken. It is not a simple thing, however. And I
> suspect it won’t look very much like typical HTML.
>
> 2. What do you mean by "interoperable"? I’ll simply quote my previous
> point, as you didn’t address it:
>
> >> "Interchange" and "interoperable" are superficially simple concepts, but
> >> the reality is very different. Interchange might mean many different
> things
> >> in different contexts. Adhering to common standards such as TEI and XML
> >> makes interchange *possible*, but nothing is going to make it
> >> plug-and-play.
>
> I will add that texts are complex things, and what you want to do with
> yours might be quite different from what I want to do with mine. What does
> it mean for a Shakespeare play and a documentary papyrus to be
> "interoperable"? I have no idea.
>
> 3. Standards are lovely. Have you ever *seen* HTML in the wild? It’s tag
> soup. As I said previously, HTML has very few constraints. Yes, you can
> validate it, but hardly anyone does. What passes for validation is, "Does
> it look right in a browser? Yes? Ship it!" TEI XML has many more
> constraints than HTML, and I will repeat my point that this makes it easier
> to work with during its creation, for anyone wanting to re-use it, and from
> the perspective of digital preservation. In order to get the same
> affordances in HTML, we’d have to do a *huge* amount of work, and it might
> not be possible at all, because progress in HTML, for all that it’s a
> standard, is driven by big companies, not us poor digital humanists.
>
> 4. Sorry, does not follow, because your point #3 does not stand up to
> scrutiny.
>
> 5. I’m not really swayed that much by arguments about verbosity, because
> there are usually technological solutions to them, for example editors with
> autosuggest features or domain-specific markup languages with a minimal and
> restricted syntax which can be expanded to the standard serialization. But:
> the vast majority of people who create HTML content are doing so in a
> WYSIWYG editor or with Markdown vel sim.. They don’t actually know HTML.
> What makes you think TEI-equivalent HTML would be any easier for them to
> learn than TEI XML? Martin is quite right that as of now, editing TEI XML
> is much easier than it would be to edit HTML with equivalent semantics. If
> your TEI encoders are making certain mistakes consistently, it’s very easy
> to write Schematron rules that will flag those and make the document
> invalid until they are corrected. This kind of workflow isn’t hard to
> manage in TEI, which is one reason people use it.
>
> In short, XML has lots of advantages that currently make it better suited
> to the creation and publication of TEI texts than HTML. That might change
> in the future. TEI will, if it survives, not always be expressed only in
> XML. But you’re arguing as if a switch to HTML would be easy, and it simply
> ain’t so.
>
> All the best,
> Hugh
>
> _______________________________________________
> Unsubscribe at:
> http://www.dhhumanist.org/Restricted/listmember_interface.php
> List posts to: humanist at lists.digitalhumanities.org
> List info and archives at at: http://digitalhumanities.org/humanist
> Listmember interface at:
> http://digitalhumanities.org/humanist/Restricted/listmember_interface.php
> Subscribe at:
> http://www.digitalhumanities.org/humanist/membership_form.php



--[2]------------------------------------------------------------------------
        Date: Tue, 21 Oct 2014 07:39:30 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.419 HTML vs XML for TEI
        In-Reply-To: <20141020060521.022C9660A at digitalhumanities.org>


Hi Hugh,

Your "unravelling" seems to me more like erecting strawmen that vaguely
> resemble Martin’s and my responses, which you can then easily knock over.
> This is not arguing in good faith.

OK I'll use quoted sections instead of headings but it's a lot more
verbose. But don't be surprised if I misunderstand what you say at times.
We all do that.

So while I’m happy and interested to discuss these things, I’m not prepared
> to do so under those conditions. If you want to have a serious discussion,
> I’m up for it, but if not, this will probably be my last word on the
> subject.
>

That suits me. I always like to get in the last word ;-)

>
> 1. CSS, wonderful as it is, does not transform HTML automatically into a
> semantic markup language. As proofs, I give you HTML’s seven (!) levels of
> header, its <p> tag that isn’t actually a paragraph, and it’s ordered vs.
> unordered (!) lists. I will buy you a beer if you can explain to me how a
> make a list that isn’t ordered. HTML has taken some steps in a semantic
> direction, certainly, but its basic nature hasn’t changed.
>
> That said, I do think an HTML flavor of TEI is possible, and in fact steps
> towards this are being taken. It is not a simple thing, however. And I
> suspect it won’t look very much like typical HTML.
>

I never said CSS would turn HTML into a semantic language, I said something
like RDFa would. I don't think the TEI should make any changes to HTML. All
they have to do is specify a list of standard properties, gleaned from the
Guidelines as they now stand, that looks like a longer list of abstract
Dublin Core properties. I don't think digital humanists should be in the
business of specifying their own markup languages. HTML will be the
coat-hanger on which the semantic information will be hung. Since 97% of
documentary TEI texts (apparently) get converted into HTML anyway, I don't
think we'll find it too hard to do that conversion given that the necessary
XSLT stylesheets will already exist.

2. What do you mean by "interoperable"? I’ll simply quote my previous
> point, as you didn’t address it:
>
> Here's my definition from my TEI Journal paper volume 7 (soon to come out,
unless Martin changes his mind after this):

"Interoperability may be defined as the property of data that allows it to
be loaded unmodified and fully used in a variety of software applications.
Interchange is basically the same property that applies after a preliminary
conversion of the data (Bauman 2011; Unsworth 2011), and implies some loss
of information in the process. Interchange can thus be seen as an easier,
less stringent or less useful kind of information exchange than pure
interoperability."

As an example, SVG is an interoperable XML format. TEI-XML is not. As I
said in the initial post, I think it is unrealistic to expect that we will
ever interoperate at the level of TEI tags. Although this was set as a
distant goal in P3 and P4:

"4.The guidelines should propose sets of coding conventions suited for
various applications.... consensus on suitable conventions for different
applications proved elusive; this remains a goal for future work."

In P5 this was changed to:

"the TEI Guidelines define a general-purpose encoding scheme which makes it
possible to encode different views of text, possibly intended for different
applications, ... no predefined encoding scheme can possibly serve all
research purposes".

I take this as they gave up, because it really *is* impossible.

>
> I will add that texts are complex things, and what you want to do with
> yours might be quite different from what I want to do with mine. What does
> it mean for a Shakespeare play and a documentary papyrus to be
> "interoperable"? I have no idea.
>
> We can interoperate at the level of HTML. That will allow us to share our
texts and they will contain all the data that they do now. And semantic Web
applications can be used to process the RDFa.  As it stands we have to
write a custom application every time we want to access someone else's
TEI-XML.

3. Standards are lovely. Have you ever *seen* HTML in the wild? It’s tag
> soup.

Yes, but that's mostly the fault of CMSes. For those who write their own
HTML I think we can do better. Most TEI-XML files I see produced contain
glaring mistakes, too, like using white space *within* elements to pretty
print the XML, or the absolute commonest general mistake is misusing
elements for the wrong purpose because the transcriber doesn't know what is
in the Guidelines because they are too big.

As I said previously, HTML has very few constraints.Yes, you can validate
> it, but hardly anyone does. What passes for validation is, "Does it look
> right in a browser? Yes? Ship it!" TEI XML has many more constraints than
> HTML, and I will repeat my point that this makes it easier to work with
> during its creation, for anyone wanting to re-use it, and from the
> perspective of digital preservation.

Syntax checking and deep structure are features that serve the needs of
XML, not the user. Last time I looked, Shakespeare didn't have any
angle-bracketed tags in it. It is just black marks on a page. The XML
surrogate is in this respect a figment of the transcriber's imagination. So
I don't see what the constraints produce, other than enabling the text to
be processed in XML applications. That's self-justifying.

> In order to get the same affordances in HTML, we’d have to do a *huge*
> amount of work, and it might not be possible at all, because progress in
> HTML, for all that it’s a standard, is driven by big companies, not us poor
> digital humanists.
>
> Excuse me, but didn't Microsoft have a big hand in the development of XML?
They wanted it to for Web application development, or at least that's what
James Clark, the lead technical developer of XML says. (
http://blog.jclark.com/2010/11/xml-vs-web_24.html). I wasn't suggesting
replacing *TEI* with HTML, but XML with HTML. The TEI part would remain.

5. I’m not really swayed that much by arguments about verbosity, because
> there are usually technological solutions to them,

Neither am I. But building *general* TEI-capable editors that hide all the
markup doesn't work, because there are lots of features of TEI that don't
translate directly into formatting. So you end up editing the XML directly,
and that's bad from a HCI point of view, because it increases memory load
dramatically.Yes, there are some good specific TEI-based editors, but they
don't consume general TEI that anyone produces. But with HTML+RDFa that
would be possible, because the coathanger of HTML could be rendered in
WYSIWYG form. And the RDFa... I don't know. Since it is simple and a
standard I think something general could be devised.

 TEI will, if it survives, not always be expressed only in XML. But you’re
> arguing as if a switch to HTML would be easy, and it simply ain’t so.
>

I think the survival of TEI is threatened by two things: the biggest threat
is not adapting it to changes in technology, and that's not just whether
XML is fading away, as many now claim. It's also the structure of a TEI
file. It's monolithic, it has annotations and metadata and versions all
bundled up in the one overloaded format. They can and should be separated
out. And secondly it's just too big to comprehend. I could give you a list
of bizarre things about TEI which indicate that it needs a thorough
clean-out, restructuring and revision (TEI See "Howlers":
http://digitalvariants.blogspot.com.au/2014/01/decoupling-tei-from-xml.html
).

I feel I agree with you on quite a few points, so don't take these attacks
too hard. I *am* trying to be constructive.

Desmond Schmidt
Queensland University of Technology.

On Mon, Oct 20, 2014 at 4:05 PM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>
>                  Humanist Discussion Group, Vol. 28, No. 419.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>   [1]   From:    Martin Holmes <mholmes at uvic.ca>
> (116)
>         Subject: Re: [Humanist] 28.414 HTML vs XML for TEI
>
>   [2]   From:    Hugh Cayless <philomousos at gmail.com>
>  (104)
>         Subject: Re:  28.414 HTML vs XML for TEI
>
>
>
> --[1]------------------------------------------------------------------------
>         Date: Sun, 19 Oct 2014 05:56:50 -0700
>         From: Martin Holmes <mholmes at uvic.ca>
>         Subject: Re: [Humanist] 28.414 HTML vs XML for TEI
>         In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>
>
>
> Hi Desmond,
>
> On 14-10-18 11:43 PM, Humanist Discussion Group wrote:
> >
> > Dear Hugh and Martin,
> >
> > Please forgive me for attempting to pick apart your arguments, but I
> > think that readers of Humanist have a right to have the facts unravelled,
> > and judge for themselves.
> >
> > 1. HTML is presentation-oriented whereas XML describes logical structure.
> >
> > Direct formatting of elements in modern HTML has long been deprecated.
> > CSS formatting cleanly separates rendition from textual structure in
> > ways that XML cannot match. Of course, XML is *supposed* to store the
> > clean logical structure of the source documents, but as I have been
> > informed on numerous occasions when reviewing other people's TEI-XML,
> > embedding end-result related information directly into the TEI source is
> > now common and accepted practice. Just take a look at the examples
> > in the Guidelines for <surface> and <zone> some time.
>
> I think it's important to distinguish between detailed description of
> what a source text looks like (for which we use CSS all the time,
> because it's a great tool for the job), and information intended for the
> rendering of a text. We do not (or we should not) use any of the current
> TEI mechanisms for embedding information descriptive of typographical
> features, layout, etc. in a text for embedding output or processing
> information. If you know of any such examples in the Guidelines, please
> report them as a bug.
>
> That's not to say, of course, that we may not use embedded information
> describing the source text to render it for a reader in a similar way;
> but that's not its purpose.
>
> > HTML *started out* as an explicit presentation format, but it has become
> > the lingua franca of mixed content on the Web, and has been extended
> > with powerful mechanisms for expressing trillions of documents of all
> > kinds. I find it hard to believe that it could not also express the
> > documents that digital humanists seek to record. Are we so different?
>
> No. But HTML5 (for example) has (I think) fewer than 120 elements,
> whereas TEI has nearly 600, and we deal with a steady flow of requests
> for more. TEI has much more descriptive power than HTML. And I would
> claim that it's much easier to use this descriptive power in TEI XML
> than it would be to hack it into HTML; to do that, you would have to
> decide whether (for instance) a <nationality> element should be
> expressed as a <span> or a <div> or a <p> in HTML, and what its
> distinguishing attribute should be; and in order to make the results
> truly interoperable, we would all have to agree on those things, and
> document them, and create a new Guidelines document to formalize them.
> The result would be more verbose and less human-readable.
>
> > 2. Using HTML instead of XML would solve no problems
> >
> > Indeed it does solve at least one big problem: it would make TEI-encoded
> > texts interoperable across thousands of applications that already
> > understand HTML. At the moment, given the immense variation in the
> > selection and application of TEI tags, not only is interoperability
> > impossible, but even interchange (that is, lossy conversion in order to
> > re-use a document) is difficult without prior agreement. How are we
> > supposed to work together when the language through which we communicate
> > actually impedes collaboration?
>
> I don't think it would, for the reasons I mention above. It might make
> it more immediately renderable in web browsers; but a simple CSS
> stylesheet can do that for TEI anyway.
>
> > 3. HTML is less stable than TEI-XML
> >
> > HTML is in its fifth standardised definition in 22 years. In that time
> TEI
> > has also had five major revisions, but numerous incremental changes. P5
> > has gone through *23* revisions since 2007. Furthermore, HTML is strictly
> > standardised by the ISO and W3C. Software vendors are at a disadvantage
> > if they attempt to deviate from the standard. On the other hand, users of
> > TEI are positively encouraged to customise and extend TEI to suit their
> > needs.
>
> Most users, when customizing, don't _extend_ TEI; they actually
> constrain it further. I don't have a single project in which I've
> introduced anything new into my TEI schema (and I have a lot of TEI
> projects); all my TEI files should validate against tei_all (the
> comprehensive TEI schema). We also make every effort to avoid breaking
> backwards compatibility in any updates to TEI; we have very rarely done
> it, and in all the cases I can think of since I've been on the TEI
> Council, only after determining that the change will not materially
> affect more than a handful of existing users (or no-one at all).
>
> > 4. XML is a better archiving format than HTML
> >
> > This follows naturally from point 3: the more stable a format the better
> > it is for archiving.
>
> It's very good practice to generate as many different output formats as
> you can for reasons of archiving, survivability and
> interoperability/interchange. TEI is a good one, because it comes with
> built-in detailed documentation (if you've done a good job with your ODD
> file). But HTML is another good one, for different purposes. It's easier
> to produce HTML from TEI than the reverse.
>
> > 5. TEI is easier to type than HTML
> >
> > The exact format of the TEI part is not yet decided, so examples that
> > compare verbosity are not feasible.
> >
> > Also, many people understand HTML already, but have to be trained to
> > use TEI-XML. In my experience of supervising such work, encoders make
> > many mistakes that take years to unlearn, and have to be constantly
> > corrected. As a result, keeping the text consistent, even within a single
> > project, is extremely difficult.
>
> In all my years of teaching people HTML, TEI and other languages, I've
> never come across a single person younger than me who has had any
> problem learning it; nowadays, given all the extra aids we have from
> tools such as Oxygen, it takes a remarkably short time to get productive
> with TEI. I've come across a few older people who claim they find it
> hard, but generally before they've actually made any effort to learn it.
>
> > As Hugh admits:
> >
> >> It may now, given the current state of the technology, be possible to
> >> sensibly express TEI in HTML
> >
> > Indeed. In that case I ask again, why don't we do it, and all talk to
> > each other in the language of the Web?
>
> Because we'd have to rewrite the TEI Guidelines in HTML in order to
> ensure we're all using the same expressions in HTML for each of the
> things that are already well-described in TEI; and the result would be
> harder to maintain and transform into other things.
>
> Cheers,
> Martin
>
>
>
>
> --[2]------------------------------------------------------------------------
>         Date: Sun, 19 Oct 2014 14:03:43 -0400
>         From: Hugh Cayless <philomousos at gmail.com>
>         Subject: Re:  28.414 HTML vs XML for TEI
>         In-Reply-To: <20141019064359.8F581612A at digitalhumanities.org>
>
>
> Dear Desmond,
>
> Your "unravelling" seems to me more like erecting strawmen that vaguely
> resemble Martin’s and my responses, which you can then easily knock over.
> This is not arguing in good faith. So while I’m happy and interested to
> discuss these things, I’m not prepared to do so under those conditions. If
> you want to have a serious discussion, I’m up for it, but if not, this will
> probably be my last word on the subject.
>
> 1. CSS, wonderful as it is, does not transform HTML automatically into a
> semantic markup language. As proofs, I give you HTML’s seven (!) levels of
> header, its <p> tag that isn’t actually a paragraph, and it’s ordered vs.
> unordered (!) lists. I will buy you a beer if you can explain to me how a
> make a list that isn’t ordered. HTML has taken some steps in a semantic
> direction, certainly, but its basic nature hasn’t changed.
>
> That said, I do think an HTML flavor of TEI is possible, and in fact steps
> towards this are being taken. It is not a simple thing, however. And I
> suspect it won’t look very much like typical HTML.
>
> 2. What do you mean by "interoperable"? I’ll simply quote my previous
> point, as you didn’t address it:
>
> >> "Interchange" and "interoperable" are superficially simple concepts, but
> >> the reality is very different. Interchange might mean many different
> things
> >> in different contexts. Adhering to common standards such as TEI and XML
> >> makes interchange *possible*, but nothing is going to make it
> >> plug-and-play.
>
> I will add that texts are complex things, and what you want to do with
> yours might be quite different from what I want to do with mine. What does
> it mean for a Shakespeare play and a documentary papyrus to be
> "interoperable"? I have no idea.
>
> 3. Standards are lovely. Have you ever *seen* HTML in the wild? It’s tag
> soup. As I said previously, HTML has very few constraints. Yes, you can
> validate it, but hardly anyone does. What passes for validation is, "Does
> it look right in a browser? Yes? Ship it!" TEI XML has many more
> constraints than HTML, and I will repeat my point that this makes it easier
> to work with during its creation, for anyone wanting to re-use it, and from
> the perspective of digital preservation. In order to get the same
> affordances in HTML, we’d have to do a *huge* amount of work, and it might
> not be possible at all, because progress in HTML, for all that it’s a
> standard, is driven by big companies, not us poor digital humanists.
>
> 4. Sorry, does not follow, because your point #3 does not stand up to
> scrutiny.
>
> 5. I’m not really swayed that much by arguments about verbosity, because
> there are usually technological solutions to them, for example editors with
> autosuggest features or domain-specific markup languages with a minimal and
> restricted syntax which can be expanded to the standard serialization. But:
> the vast majority of people who create HTML content are doing so in a
> WYSIWYG editor or with Markdown vel sim.. They don’t actually know HTML.
> What makes you think TEI-equivalent HTML would be any easier for them to
> learn than TEI XML? Martin is quite right that as of now, editing TEI XML
> is much easier than it would be to edit HTML with equivalent semantics. If
> your TEI encoders are making certain mistakes consistently, it’s very easy
> to write Schematron rules that will flag those and make the document
> invalid until they are corrected. This kind of workflow isn’t hard to
> manage in TEI, which is one reason people use it.
>
> In short, XML has lots of advantages that currently make it better suited
> to the creation and publication of TEI texts than HTML. That might change
> in the future. TEI will, if it survives, not always be expressed only in
> XML. But you’re arguing as if a switch to HTML would be easy, and it simply
> ain’t so.
>
> All the best,
> Hugh
>
> _______________________________________________
> Unsubscribe at:
> http://www.dhhumanist.org/Restricted/listmember_interface.php
> List posts to: humanist at lists.digitalhumanities.org
> List info and archives at at: http://digitalhumanities.org/humanist
> Listmember interface at:
> http://digitalhumanities.org/humanist/Restricted/listmember_interface.php
> Subscribe at:
> http://www.digitalhumanities.org/humanist/membership_form.php





More information about the Humanist mailing list