[Humanist] 28.437 HTML vs XML for TEI

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sat Oct 25 10:12:54 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 437.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Martin Holmes <mholmes at uvic.ca>                          (114)
        Subject: RE:  28.434 HTML vs XML for TEI -- and TEI Simple

  [2]   From:    Hugh Cayless <philomousos at gmail.com>                     (163)
        Subject: Re:  28.421 HTML vs XML for TEI

  [3]   From:    Karin Dalziel <promo at nirak.net>                          (313)
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives

  [4]   From:    Ed Summers <ehs at pobox.com>                                (14)
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives


--[1]------------------------------------------------------------------------
        Date: Fri, 24 Oct 2014 13:49:16 +0000
        From: Martin Holmes <mholmes at uvic.ca>
        Subject: RE:  28.434 HTML vs XML for TEI -- and TEI Simple
        In-Reply-To: <20141024050949.A4D2F8659 at digitalhumanities.org>

Hi Desmond,

I don't want to monopolize this list with a well-aired debate, so I'll make this my last post on the topic:

> you're ignoring the RDFa part of my proposal, which bears the semantic
> information. I wasn't proposing eliminating anything useful from the TEI
> scheme, just expressing it in an abstract way for use in more modern, and
> future technologies.

If you're not objecting to the more abstract goals of the TEI -- the development and documentation of ontologies for classifying textual features and metadata information related to humanities texts -- then I have no quarrel with that. We currently use XML because it's proved by far the most effective way to encode and manipulate that information. It's perfectly possible to express it all in many different ways, and the TEI expects that at some point in the future, XML will be superceded by something better; but most of us don't believe that is close, and in fact, as schema, query and transformation languages and tools keep getting better, XML is actually working better for us all the time. 

> In your example you reduce to the absurd the variability of legitimate but
> unlikely encodings in HTML for poetic lines.

I'm sure I don't. I've seen all of these particular renderings, and I've actually generated a two of the three myself, as part of serious projects.

> In TEI one can play even
> wilder games with the same material, because there are many more tags with
> almost the same meaning, plus looser attribute definitions, to play with:
> 
> <l>I wandered lonely as a cloud,</l>...
> <div type="stanza">I wandered lonely as a cloud,<lb/>...
> <div type="line">I wandered lonely as a cloud,</div>
> <ab type="line">I wandered lonely as a cloud,</ab>
> <seg type="line">I wandered lonely as a cloud,</seg>
> etc.

But the point is that in TEI, the <l> element is specifically documented as being for this purpose ("contains a single, possibly incomplete, line of verse"). Nothing can prevent a user from choosing to encode something in an alternative manner, but they would be departing from the Guidelines in doing that, and they would know that they were doing so (unless they hadn't bothered to consult the Guidelines). In HTML, there is no prescribed or recommended way of encoding a line of verse; we would all have to make up our own systems, or agree on a convention -- which is what the TEI is. Incidentally, Brett Zamir (IIRC) proposed a very straightforward way of serializing TEI as HTML5 using HTML5's custom data attributes; I believe this would be non-lossy and reversible. So if you want your TEI as HTML5, you can do that perfectly well. You could even encode it that way in the first place, but I believe that would be hard work because, unless you created a custom HTML5 schema to help you, you wouldn't have any of the prompts and helpful constraints the TEI XML schemas provide for the encoder.
 
> In TEI an <l> element may contain any one of 196 different types of other
> TEI elements, and may itself be contained by 53 different types of
> elements. I don't see how that is highly constrained as claimed.

That's precisely my point: TEI is huge, and one of the first things we do when starting a TEI project is to further constrain it so that it contains only what we need. We produce schemas and documents which are highly constrained, but which all (typically) conform to "tei_all", the big schema which includes everything.

> I think you must be misunderstanding the purpose of surface/zone markup.
>> The idea here is to be able to link areas on images (typically page-images
>> in a facsimile) to other aspects of markup; so, for example, one might
>> define a zone outlining a stanza in a poem, and link that to a
>> transcription of the poem encoded using <lg> and <l>. There are no
>> implications for rendering whatsoever.
>>
> 
> As I said before, we may use that information in the process of rendering
>> an online facsimile edition (for example); but all it's actually saying is:
>> here is a shape on the page-image, with an @xml:id.
>>
> 
> In that case I suggest that you rename the TEI Guidelines the TIEI (Text
> and Image encoding) Guidelines, since it now contains markup for images.

Images are texts; most of us have seen them as texts for a long time. Here's an example of an image which is a text:

 http://mariage.uvic.ca/anth_doc.htm?id=la_femme_battant 

 -- not only because it's a text-bearing object, but also because it's laden with symbolism and cultural information. It's encoded in TEI, and annotated and rendered in such a way that (we hope) it's a little more accessible to modern readers.

The TEI encoding also enables us to search the entire text collection as a whole, retrieving not just hits in texts, but also fragments of images like the one above, in the same collection:

 http://mariage.uvic.ca/search.htm?quickSearch=poule 

> You have to draw the line somewhere, and the element in question does not
> describe text.

In the case of the most straightforward use of <zone> (in a digital facsimile encoded in TEI), it links an area on a representation of an original source page to the transcription of the text which appears there, or to editorial commentary on it. 

> Such information should be external to the textual
> surrogate, not part of it.

I'm not sure I understand this. It is "external" in that it's part of the <facsimile> component of the TEI file, not the <text>; and it can perfectly well be external in a more literal sense by storing it in a separate file. But there's no reason why it shouldn't be encoded in TEI.

> no I hadn't seen that one, but this one springs to mind also:
> http://www.google.com/trends/explore#q=xml,json
> Perhaps people don't realise how many billions of queries these graphs are
> based on. The decline is in XML's popularity is very real.

I would say that people don't have to search for something once they know what it is and have a basic understanding of it. But any comparison on XML and JSON is really apples and oranges anyway; they have different uses and purposes.

> I'm sorry that I don't share your enthusiasm for TEI Simple, as it is
> described. I can only ask what went wrong with TEI-Lite and TEI-Tite and
> DTA-basis format and TextGrid baseline encoding that TEI-Simple is going to
> fix?

My main interest in the TEI Simple proposal personally is probably going to enrage you even further: it promises to provide a mechanism for formally specifying a processing model for a TEI ODD. I'm interested in this because a) it looks like it might result in abstract hierarchical class structures for TEI elements, instead of the flat class system we currently have, and that intrigues me; and b) at the moment, I don't think it can actually be done successfully, but far brighter people than me do, and I want to see how it's going to work out.

> Could I perhaps interest you instead in basing TEI Simple on
> *abstract* properties of text rather than a fixed XML syntax?

I think Chapter 23 of the Guidelines does draw a distinction between the TEI abstract model and its current XML incarnation:

 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/USE.html 

TEI is currently expressed in XML; it was formerly expressed in SGML, and it will probably be expressed in something else in the not-too-distant future. But its abstract model is the embodiment of an ongoing debate in a large community about what the salient features, components and aspects of "texts" are; and that model (currently) claims that there is something we might call a "paragraph", and that it's recommended that you encode it with a <p></p> if you're using TEI XML, so we all understand what you mean.

> Imposing a
> strict syntax even at a coarse grained level will I fear not work, because
> everyone interprets the same codes differently, however simple they are.

They may, but we do make a serious effort to counter that by providing useful guidelines.

> Any attempt to retrieve information from deeply encoded documents which
> have been marked up by humans - exactly as you point out in your quote -
> will have a very poor recall factor, although it will be precise. If your
> textual enquiry is roughly "find all the quotes in all the lines of all the
> stanzas of all the poems by Joe Bloggs", a percentage of the elements
> retrieved will be lost at each level of the hierarchy due to variations in
> the way those elements are encoded, until you may find no such quotes at
> all, even though hundreds of them may exist. This is what the DTA already
> complained about (Geyken et al. 2012). You should also reconsider what
> Patrick Durusau said in Electronic Textual Editing about the loss of
> information that variation in the encoding of even a *single* tag leads to.

I agree completely with this. It's one thing that TEI Simple is intended to help with, by limiting itself to one method of encoding any given feature. But we do not encode merely for interchange and interoperability; we encode for the purposes of our own project. We may then downsample or convert in some slightly lossy way to improve the chances of successful interchange; and I wish we did that better, and more. I'm hoping to talk about that next year at DH in Sydney.

> The expressed goal of TEI Tite was to specify *"exactly one* way of
> encoding a particular feature of a document in as many cases as possible,
> ensuring that any two encoders would produce the same XML document for a
> source document." If it succeeded in that regard, I don't understand the
> need for TEI Simple.

TEI Tite had a specific audience: 

"TEI Tite is a constrained customization of TEI designed for use when outsourcing production of TEI documents to vendors, who use some combination of OCR and keyboarding to produce encoded text."

<http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_tite.doc.html#intro>
> 
> There are no implications for rendering whatsoever.
> 
>> As I said before, we may use that information in the process of rendering
>> an online facsimile edition
>>
> 
> I find it impossible to reconcile these two statements.

Why? Is it so strange that I would choose to use some of the information I've encoded _about_ the source text, when I come to render a version of it for a reader? Even copying a Unicode codepoint from the transcription to the rendered page does this, surely?

> If in <zone> there are "no implications for rendering whatsoever" how can
> you then use, even sometimes,"that information in the process of
> rendering"? And when not used for rendering, what is its purpose? Surely
> only to be ignored.

Well, as in the example above, you can use it for searching text collections that contain graphical texts, for instance.

All the best from a lively TEI Conference,
Martin
Martin Holmes
mholmes at uvic.ca
martin at mholmes.com
mholmes at halfbakedsoftware.com


--[2]------------------------------------------------------------------------
        Date: Fri, 24 Oct 2014 10:21:21 -0400
        From: Hugh Cayless <philomousos at gmail.com>
        Subject: Re:  28.421 HTML vs XML for TEI
        In-Reply-To: <20141021051843.538966715 at digitalhumanities.org>

I’m all about constructive :-). 

There’s a disconnect between RDF and structured markup that makes me think such a mapping would not be trivial, so again, you’re underestimating the level of difficulty involved. But leaving that aside, an XML-based workflow means a single source document can be used to produce (for example) one or more HTML views, indices, documents for indexing in search engines (e.g. Solr), print-ready documents, and RDF for Linked Data. The workflow story with HTML isn’t so clear, likely because HTML is usually a destination format, not a source format. So you’re arguing for doing a huge amount of work in order to migrate to a less-usable format. I don’t rule out the current of affairs changing, but it’s what we face now. I believe in incremental development, not throwing out working processes in favor of theoretical shiny things.

> "Interoperability may be defined as the property of data that allows it to
> be loaded unmodified and fully used in a variety of software applications.
> Interchange is basically the same property that applies after a preliminary
> conversion of the data (Bauman 2011; Unsworth 2011), and implies some loss
> of information in the process. Interchange can thus be seen as an easier,
> less stringent or less useful kind of information exchange than pure
> interoperability."

That’s a fair definition. But I don’t see it as a sensible goal for anything other than very standard TEI. To put it another way, interoperability might be a goal of a specific customization of TEI, but it’s not something I’d be interested in imposing on TEI as a whole. People want to do different things with different kinds of text. 

>> We can interoperate at the level of HTML. That will allow us to share our
> texts and they will contain all the data that they do now. And semantic Web
> applications can be used to process the RDFa.  As it stands we have to
> write a custom application every time we want to access someone else's
> TEI-XML.

I’m not seeing any specifics there. So we’ll be able to open each others’ texts in a browser? Fine. I promise you the RDFa (or whatever) data it is won’t be consistent.

> Syntax checking and deep structure are features that serve the needs of
> XML, not the user. Last time I looked, Shakespeare didn't have any
> angle-bracketed tags in it. It is just black marks on a page. The XML
> surrogate is in this respect a figment of the transcriber's imagination. So
> I don't see what the constraints produce, other than enabling the text to
> be processed in XML applications. That's self-justifying.

These constraints help control certain types of error and they give us hooks to hang conventions and documentation on. That kind of error checking is completely absent for HTML/RDFa. With TEI-in-HTML you’d have about 50 flavors of <span>. How would we keep them straight?

> But with HTML+RDFa that
> would be possible, because the coathanger of HTML could be rendered in
> WYSIWYG form. And the RDFa... I don't know. Since it is simple and a
> standard I think something general could be devised.

The HTML would render. Sort of. But where’s the win here? I still don’t see what being able to open my document directly in a web browser is going to gain me beyond being able to view it in a web browser. I’ll give you a very small example: If I have a document with <supplied reason="lost">this</supplied> in it, I will want to render that for a reader as [this], with brackets around it. I could delegate the insertion of those brackets to CSS content, but I don’t want to, because my user won’t be able to copy and paste it. So CSS isn’t enough, I need CSS + Javascript that mutates my document dynamically in the browser. This is perfectly possible to do, but I’ve just replaced my supplied tag with something like <span typeof="http://www.tei-c.org/ns/1.0#supplied" data-reason="lost">this</span> (and incidentally, it could not be so simple if we’re really using RDFa) plus jQuery or something, plus code to grab all my spans with type supplied and reason="lost" and wrap them in square brackets, all of which has to be in, or linked from, my TEI HTML document. And that’s all so it looks ok when someone opens it. The HTML format, improved as it is, still forces me to embed presentation. And I’ll point out again that the presentation view is only one use I might want to make of that document. It’s just not an improvement on any level.

The discussion has moved on a little bit, so I’ll also address the decline in popularity of XML. In a nutshell, I don’t care even a little bit that XML was one the primary data interchange format and now it isn’t. We’re all better off not being forced to use it for configuration files or to load snippets of data into web pages, and the demise of abominations like SOAP is to be celebrated. So? There are thriving communities using XML for mixed content applications—the same sorts of things SGML was good for back in the day. The volume of use cases is much smaller than the list of things XML was used for at the height of its popularity, but that’s OK. The XML explosion was fashion-driven to a large extent, and fashion is a dumb reason to pick a tool. It still works very well for the sorts of things TEI tries to do. When there’s a better alternative, I think the TEI should adopt it. HTML is not (yet) a better alternative.

What I’m seeing in your argument is a desire to impose order on an ecosystem from the top down. There is always this tension between the need to standardize and the need to customize—by the latter I don’t mean necessarily to alter the specification itself, but to choose to mark certain features of a text and not others. If I understand your arguments, you feel TEI provides too much flexibility and would be better expressed in a format that is more general and has less expressive scope, but is easy to work with, particularly from a web-publishing perspective. I don’t think that’s an unreasonable opinion, but I hope you’ll forgive me if I don’t share it.

All the best,
Hugh

/**
 *  Hugh A. Cayless, Ph.D
 *  hugh.cayless at duke.edu
 *  Duke Collaboratory for Classics Computing (DC3)
 *  http://blogs.library.duke.edu/dcthree/
**/


--[3]------------------------------------------------------------------------
        Date: Fri, 24 Oct 2014 09:38:47 -0500
        From: Karin Dalziel <promo at nirak.net>
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141013052556.64A04656E at digitalhumanities.org>


Ed,

Sorry to pick up on an old thread, but your response really intrigued me:

> PS. On the subject of NoSQL, one thing that you might want to consider is
leap frogging over traditional client/server
> web frameworks (Django, Rails, etc) and creating a REST web service on
top of Solr, which is then used by a JavaScript
> web framework (Boostrap, Angular, Ember, etc). This would allow you to
simply use Solr, and not use a RDBMs like
> MySQL or PostgreSQL. The advantage here is that you won’t have to keep
PostgreSQL and Solr synchronized. Also, your
> API could be used by mobile apps, and third parties. The disadvantage is
that you will understand and constrain the logical
> model of your data less. It might be worth asking if your IT shop
supports ElasticSearch in addition to Solr, since it offers a
> better more API, and was built to scale a bit better than Solr was.

I'm confused as to why you can't just drop the database out of the
django/mysql/solr and keep everything else the same - couldn't you just
use, say, Rails or Django with SOLR for both the API and interface and skip
the database component if the files are indexed correctly to support your
requests? (I realize Django or Rails might be overkill if you're skipping
the ORM, but they're still useful in many other respects.)

Isn't SOLR itself a REST web service? What would be the advantage to
building an API on top of what SOLR provides? (Or using Elasticsearch if
you preferred that API?)

What would you build the REST web interface with, if not a traditional
client/server framework like Django or Rails?

I'm a little curious about the suggestion of AngularJS and Ember, because I
always thought of those as application frameworks for single page
interactive apps (like gmail) which would be overkill for a site that was
primarily a straightforward presentation of data/content, especially when
one wants to provide static URL's and very accessible, SEO friendly content
by default. I'm seeing lots of people suggest them as general use
frameworks for all kinds of sites, though, so I'd love to hear the
arguments for a more general use.
----------------------------------------------------------
Karin Dalziel
Center for Digital Research in the Humanities, University of
Nebraska-Lincoln



--[4]------------------------------------------------------------------------
        Date: Fri, 24 Oct 2014 11:16:49 -0500
        From: Ed Summers <ehs at pobox.com>
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives


Hi Karin, 

Thanks for the questions, they are good ones. In hindsight I probably should have left that PS off my email … since it wasn’t terribly relevant to the discussion.

> I'm confused as to why you can't just drop the database out of the django/mysql/solr and keep everything else the same - couldn't you just use, say, Rails or Django with SOLR for both the API and interface and skip the database component if the files are indexed correctly to support your requests? (I realize Django or Rails might be overkill if you're skipping the ORM, but they're still useful in many other respects.)

You certainly could do that. However, there are lighter weight Web frameworks for situations where you aren’t using an RDBMS and are serving up JSON instead of HTML. Django and Rails both come with Object-Relational-Mapping tools that often want to at least connect to a database, even if you’re not using one. They also come with lots of machinery for HTML templates, which isn’t really necessary when you are making JSON available.

> Isn't SOLR itself a REST web service? What would be the advantage to building an API on top of what SOLR provides? (Or using Elasticsearch if you preferred that API?)

Yes, if you are using a traditional server side web framework (Django, Rails, etc) you could certainly just use Solr’s (or ElasticSearch's) HTTP API directly. If you are using a client side JavaScript framework you could have it talk back to Solr/ElasticSearch, but that means opening up access to Solr/ElasticSearch to the world. You have to be careful that you don’t allow people to modify/delete stuff, which can be tricky sometimes. Sometimes you don’t want to expose everything in the database (user data, etc).

If you create your own REST service in front of Solr/ElasticSearch it lets you put some thought into what resources you want to expose, and how you want to make data available (URL patterns, etc), independent of the backing database. In theory it would allow you to change your backend database without having to change your web application much. More importantly other applications that might be using your API would not have to change when you changed the database in some way. A separate REST API also gives you a place to manage keys, quotas and authentication if you end up wanting to make the API available to other people.

> What would you build the REST web interface with, if not a traditional client/server framework like Django or Rails?

There are lots of options, depending on what your language preference is. I tend to use Python a fair bit, and have enjoyed using Flask on a few projects. I’ve also found Node’s Express framework was pretty handy for web services that need to be able to hold lots of HTTP connections open for streaming. If you start googling just look for ‘microframework’ and your preferred programming language you will probably find some stuff.

> I'm a little curious about the suggestion of AngularJS and Ember, because I always thought of those as application frameworks for single page interactive apps (like gmail) which would be overkill for a site that was primarily a straightforward presentation of data/content, especially when one wants to provide static URL's and very accessible, SEO friendly content by default. I'm seeing lots of people suggest them as general use frameworks for all kinds of sites, though, so I'd love to hear the arguments for a more general use.

Yes, that’s a fair criticism. They are oriented around web “applications” rather than web “sites”, and unless precautions are taken your content can be invisible to Googlebot, which kind of defeats the purpose of putting the content on the Web in the first place. But Google are increasingly executing JavaScript on web pages [1]. So if you have a sitemap that points to the views in theory their coverage should be getting better.

I think the main reasons why JavaScript web frameworks are increasingly popular these days is that a) they tend to be more interactive/responsive and b) they require you to create a REST API, which can be used by mobile applications, and third parties.

//Ed

[1] http://googlewebmastercentral.blogspot.com/2014/05/understanding-web-pages-better.html




More information about the Humanist mailing list