[Humanist] 26.565 Folger Digital Texts

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Dec 11 07:46:01 CET 2012

                 Humanist Discussion Group, Vol. 26, No. 565.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Mon, 10 Dec 2012 13:30:24 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re: [Humanist] 26.557 Folger Digital Texts
        In-Reply-To: <20121208160112.652563A10 at digitalhumanities.org>

Dear Desmond, and HUMANIST,

On Sat, Dec 8, 2012 at 11:01 AM, you wrote:
> I did download some of the texts.They appear to be marked up for
> linguistic analysis. I don't wish to criticise the Folger texts per
> se, but they do lead me to reflect in general on what the digital
> humanities have become. Is our Shakespeare (and everything else)
> really preserved for future generations in forms like this, or is it
> not now mostly a collection of angle-brackets? One of the advantages
> of XML has always been its supposed human readability, but the gradual
> increase in complexity over the years has now reached a point where
> the plain text format is self-defeating. When even a single line of a
> play has to be stitched together by virtually joining individually
> marked-up words how can we any longer pretend that XML is readable by
> humans? We might as well use a standard binary format.

It's a bit startling, but refreshing, to see this question asked. Yet
I think the answer is not hard to find if we look around us.

There have been several efforts in the open -- and an uncountable
number behind closed doors -- to specify a binary format for XML.
Advantages for such a format would be compactness and efficiency for
certain operations. The W3C has published as a Recommendation (the
closest thing they get to a "standard") such a format, EXI (Efficient
XML Interchange).

But none of these have really taken off. Why? Probably because the
problems with text-based XML that are mitigated by a binary XML don't
actually hurt so badly (and/or are already dealt with adequately
enough by less extreme means, such as data compression formats like
zip) to offset the costs of locking in to tools (and perhaps other
dependencies) for handling the binary.

There are niches in which this isn't the case but of course this
merely demonstrates the point: they are niches. In fact one of the
advantages of publishing in XML is that your data is ready made for
one (and all) of these systems. Every day, resources are devoted to
parsing XML (to say nothing of text-based cousins such as HTML) and
compiling it into such formats, optimized for processing, indexing and
searching. When done correctly, this gives a net savings, and one has
the advantages of both worlds.

In other words, this isn't really a binary dichotomy (forgive me), a
choice between a perfectly transparent text-based format on the one
hand, or something perfectly opaque (which may as well be binary) on
the other.

It was entirely predictable, indeed inevitable, that if XML succeeded
as a text-based data format (irrespective of whether it was
subsequently recast as a serialization format for an abstract data
structure capable of being represented in other forms, which arguably
has happened), we would see applications of XML that were not
human-readable, at least in the sense that you could open the source
files in a text editor and understand and analyze them with no help
from any tools or documentation.

Nevertheless, XML developers -- who know about such opacity better
than anyone, dealing with it at first hand -- can be heard crying in
protest every time someone suggests that they should dump the
plain-text format and go to something binary. We want to be able to
use our tools. And there are plenty of them -- because they are not so
hard to build and test, there is a robust commodity market for them,
more robust than the markets for tools to handle even standard (or
quasi-standard) binaries such as PDF or JPEG. As was known from the
start, basing XML on plain text helps solve the chicken-egg problem of
having no access to the data at all without the tools, but no way to
build tools without access to the data.

Now of course I am aware that one need not take even XML for granted
(nor do I, as Desmond knows). Yet the availability and affordability
of means and methods for handling the format is at least as important
as the format itself. Indeed, the latter argument is as often used
against XML (in favor of HTML or -- gasp -- binaries such as MS Word
format) as in favor of it.

So comparing what we have to something that doesn't exist is hardly
fair. Indeed, one response to Desmond's question would be to ask which
standard binary format (for representing XML data structures or
something else) he'd recommend. Then each project or publisher could
have the debate on the merits -- as they already do.

Note in particular that the Folger Library hasn't just published its
texts in XML. It is also publishing more than one styled version,
ready for use. And it is encouraging us to do the same (at least
non-commercially). It is all this activity, not the XML by itself,
that helps preserve Shakespeare for future generations.


Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables

More information about the Humanist mailing list