[Humanist] 26.581 XML & scholarship

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Dec 14 07:02:25 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 581.
            Department of Digital Humanities, King's College London
                              www.dhhumanist.org/
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (281)
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship

  [2]   From:    Wendell Piez <wapiez at wendellpiez.com>                     (67)
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship

  [3]   From:    Doug Reside <dougreside at gmail.com>                        (72)
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship


--[1]------------------------------------------------------------------------
        Date: Thu, 13 Dec 2012 22:07:42 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship
        In-Reply-To: <20121213083651.DB8792DA3 at digitalhumanities.org>

Elena,

I actually wasn't advocating a plain text format, only plain text XML
versus binary formats including binary XML. But I find it surprising
that you argue digital humanists use XML because it meets their
scholarly needs and not because it has lots of tools. Scholarly needs
cannot be determined by describing how one individual personally uses
XML to edit texts. That can only be determined by talking to textual
scholars (whether or not they use TEI, and many of them don't) and other
classes of user including students, interested members of the public,
researchers in other areas and teachers who use texts as an online
teaching aid. And those other people are valid users because they are
the consumers of what you produce. XML doesn't meet their needs, only
software can do that. I am likewise unconvinced by Wendell's argument
that tinkering with XML or something like it is a user requirement for
digital humanists. Maybe it's what some of them do, but it's not what
they really want to do. Users don't intrinsically want to use any
particular technology, they just want to get their work done well with
the least amount of effort. Patrick actually agrees with me on this:

>> Why humanists continue to struggle with "raw" XML as though it is
>> meaningful for the scholarly enterprise as "XML," I cannot say. What is
>> important is capturing their analysis of a text.

The important distinction is thus between the choice of technology
you make to get a task done and the task itself.

Desmond Schmidt
eResearch Lab
University of Queensland

On Thu, Dec 13, 2012 at 6:36 PM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
>
>                  Humanist Discussion Group, Vol. 26, No. 577.
>             Department of Digital Humanities, King's College London
>                               www.dhhumanist.org/
>                 Submit to: humanist at lists.digitalhumanities.org
>
>   [1]   From:    "Pierazzo, Elena" <elena.pierazzo at kcl.ac.uk>              (21)
>         Subject: XML and scholarship (was: Folger Digital Texts)
>
>   [2]   From:    Wendell Piez <wapiez at wendellpiez.com>                    (180)
>         Subject: Re: [Humanist] 26.571 Folger Digital Texts
>
>
> --[1]------------------------------------------------------------------------
>         Date: Wed, 12 Dec 2012 11:17:48 +0000
>         From: "Pierazzo, Elena" <elena.pierazzo at kcl.ac.uk>
>         Subject: XML and scholarship (was: Folger Digital Texts)
>         In-Reply-To: <20121212071710.9557A311F at digitalhumanities.org>
>
>
> Dear All,
>
> I have been reading this thread with increasing irritation as I think it leaves out some crucial points and it shows quite a few misconceptions.
>
> It seems that we are increasingly debating whether we like or not XML and whether we prefer plain texts. I think this is not really the point. Not many people actually like XML, and I'm one of them. I confess I do not feel any pang of love when I see an angle bracket. However, I think XML is a very useful tool as it allows me and others to achieve my scholarly goals better than any other tool, and the role of XML for scholarship, and in particular textual scholarship, is the part I think is being left out of this discussion.
>
> I was trained as a textual scholar in a very traditional setting, where not even the shade of an angle bracket was in sight. During that time I was growing more and more uncomfortable with the normal practice of silently intervening in the text "normalising" all sorts of features of our heritage texts. XML allowed me and many others like me to embed in the text the documentation of our editorial practice at a level of granularity that no other system was -- and is -- able to do. Furthermore the use of XML according to the TEI Guidelines allowed me and many others to debate our scholarly practice and share our successes and difficulties with a large and growing international community. I have become a much better scholar thanks to the use of XML and the TEI. So, when the the Folger Library made available their XML text they acted following scholarly best practice: to expose their editorial work in a way that other scholars can appreciate and evaluate their editorial work. Plai
>  n text has the big disadvantage of hiding under a smooth surface all sorts of editorial intervention, so it is actually false that the plain text does not contain an interpretative level: it does, and in a way that is not recoverable, it does in a non-scholarly way. Unless we are talking about very recent texts, spelling, punctuation, orthographic habits, hyphenation, and capitalisation are all silently introduced by editors. For a Renaissance play we are talking about around 3,000 silent editorial interventions as discovered myself when editing a work of an Italian playwright a few years ago [1]. I think that for Shakespeare we are talking about the same order of magnitude. And I'm not even starting on emendations.
>
> It is not true that we have adopted XML because there are a lot of tools and it is an easy solution: we were using SGML when no tools were available apart from the one we were developing ourselves. We were using it because it met the needs of our scholarly practice.
>
> Again, if someone does not like to take advantage of the rich XML markup, it is actually quite easy to write a script to strip the markup out; in the case of the Folger Texts they have used TEI, which is a largely known standard which should make it easier to know how to delete the mark up. I think reading the TEI Guidelines and therefore making sense of the markup is a small price to pay for having scholarly edited texts that follow good scholarly practice.
>
> I think I can speak for a large part of the community that we will be ready to change technology the moment we are given the opportunity to do our editorial work in a better and scholarly way. We know very well the severe limits of XML but we shall not forget its strengths.
>
> Yours
> Elena
>
> [1] I presented these figures at DH2006 in Paris: 'Just different layers? Stylesheets and digital edition methodology'.
>
> --
> Dr Elena Pierazzo
> Lecturer in Digital Humanities
> Department in Digital Humanities
> King's College London
> 26-29 Drury Lane
> London WC2B 5RL
>
> Phone: 0207-848-1949
> Fax: 0207-848-2980
> elena.pierazzo at kcl.ac.uk<mailto:elena.pierazzo at kcl.ac.uk>
> www.kcl.ac.uk/ddh


--[2]------------------------------------------------------------------------
        Date: Thu, 13 Dec 2012 11:59:04 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship
        In-Reply-To: <20121213083651.DB8792DA3 at digitalhumanities.org>

Dear Elena,

I hope it puts you more at ease if we counter that we are not
discussing the merits of XML in general vs plain text (sans markup or
syntax) in general.

Rather, we're speculating freely on the nature and affordances of a
non-XML format -- representing more than a sequence of characters,
just as XML represents its elements and attributes along with its raw
data -- that does not exist (at least not outside the lab). And we're
doing so in view of weaknesses in XML that I understand you to freely
acknowledge.

Desmond says this hypothetical format may as well be a binary, given
how opaque XML becomes in application, while I disagree. Patrick says
it doesn't really matter, since it's really all about interfaces. I
agree with that, except for one reservation, namely that a plain-text
serialization -- such as XML gives of its data model(s), which can be
represented and stored in other ways besides angle brackets -- is such
an interface, and a valuable one (maybe indispensable), mainly because
it's so accessible (in tools all the way down to plain-text editors).

Whether Desmond and I can come to agree, or decide that our debate is
moot, depends largely on what the specific features and affordances of
this format are, and in what ways "tinkering" (getting your hands
dirty) remains a necessary activity and for whom (researcher in the
humanities, developer or whomever). Desmond believes that a capable
and robust format should need no tinkering, or at any none that would
require access to a text-based serialization. I'm actually willing to
agree with this in principle (especially if you define "capable and
robust" as requiring no tinkering so deep down :-) -- just as it
becomes increasingly possible to work with HTML or even XML (at least
for some purposes) without having to see angle brackets. Yet I am
skeptical of his hypothesis that the tinkering we do to put the text
to our own uses can be cleanly separated from the tinkering that
hinders interoperability (even if I agree that this may be worse in
XML than it has to be). In view of this, I also think that coupling
the data model to a plain-text serialization -- I'd like specifically
to see a plain-text *markup syntax* (albeit not XML) -- should be
enabling and useful, contributing to the viability of something we're
all agreed (I think) would be a non-proprietary technology. Whether
this would come at too high a cost (of features or capabilities in the
data model, as I think Desmond might argue) I can't actually say --
again, we're arguing about unicorns (or maybe unicorns and basilisks).

As Patrick says, it's all about interfaces. Desmond says "this makes
my eyes cross: how can you make me do this?" Patrick says "your eyes
shouldn't have to cross: we still need better tools." I say "even if
your eyes cross, it's actually helpful if someone can get access to
the data structures in this form -- partly so they can build better
tools." You say "I don't love angle brackets, but I've looked at them
a while and my eyes aren't crossing so badly now. And the information
in them is essential for me to do my work." I think we are all
correct. In particular, I think you and I agree it's worth learning to
use the tools we have so we can build better ones.

Keep in mind also that there's background to this. All three of us
(Desmond, Patrick and myself) turn up in bibliographies of research
into data structures and formats that would be more capable than XML,
not less, of gracefully describing and working with the kinds of
complex structures in which we all are interested.

[sp [speaker}SEBASTIAN{]}[line}Ha, ha!{line]
[line}What things are these, my Lord Antonio?{line]
[line}Will money buy ’em?{sp]
[sp [speaker}ANTONIO{]}Very like. One of them{line]
[line}Is a plain fish and no doubt marketable.{line]{sp]

Cheers,
Wendell

--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^



--[3]------------------------------------------------------------------------
        Date: Thu, 13 Dec 2012 22:26:00 -0500
        From: Doug Reside <dougreside at gmail.com>
        Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship
        In-Reply-To: <20121213083651.DB8792DA3 at digitalhumanities.org>

> So, when the the Folger Library made available their XML text they acted following scholarly best practice: to expose their editorial work in a way that other scholars can appreciate and evaluate their editorial work.

Those who have spent time with me at DH conferences have probably been
forced to listen to me rail, with varying degrees of tactlessness and
un-examined ignorance, against XML/TEI frequently enough that I can
almost hear the delete buttons being pounded as folks from the near
future see my name in this message header.

And it will come as little surprise anyone that I remain suspicious of
a hierarchical, syntax-heavy data format like XML for modeling most
texts which are, to my mind, more like a stream than a tree.

But Eleana's point is a good one, one made frequently by TEI/XML
defenders, but one which I hadn't adequately considered until just
now.

We spend much of our time debating the process of *encoding* TEI, but
when I consider TEI as the response from, say, an API call, I begin to
see it from a different...angle [sorry...].

Assume I have a text that I've edited using a fancy new tool...forget
about how it works, let's just assume its done. Assume, the marked up
text is now stored in a database.  Again, forget what kind.  It's just
a database.

Now assume I want to dump all of the information about a text out in a
form that can be easily used by developers I will never meet.  How
should I hand off the data?  What should that form be?  The standard
methods are XML and JSON.  JSON is cleaner, more compact, and a little
easier to parse using modern languages. It's probably the better
option for an interchange format.  But in the larger scheme of things,
its relatively new, and there aren't a lot of good JSON scehma
validators out there if I want to ensure that the data that came out
of the database is the sort of thing I expected.  But there's already
a lot of work that's been done in XML.  So maybe I just go with XML.
Or maybe I'll provide exports to both.  I'm not sure it matters.  JSON
and XML are pretty much functionally equivalent, and I could see good
reasons for using either.

Now, as the API designer for this database, I need to decide whether
to push all the text out in a big blob and stick the metadata in a
different data block with offset pointers, or if I should it be embed
the tags in the text.  As an API designer I should probably allow
either.  As a consumer of this API, though, I'd probably more often
request the embedded markup.  If I don't know much about the database
structure, embedded tag will make it a bit easier for me to have the
data near the metadata so I can figure out exactly how I'm supposed to
process the response.

Its all pretty subjective of course, and ideally the software we build
supports lots of options...but if I only get to chose one canonical
interchange format, then maybe embedded tags aren't so bad.

Of course, this doesn't answer the question of how the encoding
happens.  Given what I just wrote, I think it makes the most sense if
XML isn't used at all for the actual data entry...

But then I think about all of the attempts I and others have made to
create "easy to use" XML editors that end up being less functional and
harder to use than a simple text editor.  Anyone with a modicum of web
design experience who has tried to edit HTML in WordPress or Drupal
usually starts hunting for the "edit source" button immediately.  It
feels like there SHOULD be a better kind of data entry tool for
text-encoding than an angle bracket editor, but I'm not yet sure what
it is.

So, after all, (I say on this road to Damascus) maybe TEI/XML isn't
_fundamentally_ bad.  There is a lot to fix for sure, but as a
mechanism for getting entire texts into a system and pushing them out
again, it's probably pretty similar to most other solutions I would
design, and probably a lot better than many.  Most of the time when
querying a text I probably just want RDF triples, or even an HTML
document, but for those times when I want to "dump data" or "import
document", TEI/XML is a pretty good solution.

 TEI is sometimes treated a bit like organic milk; it's supposed to be
more virtuous than other options, but it also can be little more than
a sort of meaningless shibboleth into a community.  People and
government agencies can sniff at your text archives because they
aren't TEI conformant, but there is very little regulation to ensure
that there is any standardization across documents with the TEI label.
 Of course, that doesn't mean the very notion of organic milk, or TEI,
is useless, I suppose...

Doug





More information about the Humanist mailing list