[Humanist] 26.609 XML & what kind of scholarship

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Dec 20 10:31:55 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 609.
            Department of Digital Humanities, King's College London
                              www.dhhumanist.org/
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Daniel Allington <daniel.allington at open.ac.uk>            (32)
        Subject: Re:  26.605 XML, TEI and what kind of scholarship?

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>        (100)
        Subject: Re:  26.599 XML & scholarship

  [3]   From:    Martin Mueller <martinmueller at northwestern.edu>          (153)
        Subject: Re:  26.605 XML, TEI and what kind of scholarship?

  [4]   From:    Wendell Piez <wapiez at wendellpiez.com>                    (146)
        Subject: Re: [Humanist] 26.605 XML, TEI and what kind of scholarship?


--[1]------------------------------------------------------------------------
        Date: Wed, 19 Dec 2012 11:09:22 +0000
        From: Daniel Allington <daniel.allington at open.ac.uk>
        Subject: Re:  26.605 XML, TEI and what kind of scholarship?
        In-Reply-To: <20121219074123.26E34DC1 at digitalhumanities.org>

Willard

I think that there may be an analogy between releasing the XML markup 'behind' an edition and releasing the source code for an application. Unreadable sources might just as well be closed, and conscientious programmers spend a great deal of effort making sure that their code is human readable. Of course, 'human readable' means 'readable by other programmers', not 'readable by any untrained person'. But this discussion was started by Desmond Schmidt, who is most definitely not 'any untrained person'.

As it happens, I disagreed with Desmond until I took it upon myself to look at the actual markup he was referring to - which was an experience akin to opening a Word file in vi. His most salient comment (from my point of view) was 'They [the texts] appear to be marked up for linguistic analysis'. If a programmer looks at source code and can do no more than guess at what might be going on, he or she may quite legitimately question its readability. Shouldn't it be the same when an editor of digital editions looks at somebody else's markup, especially when it's done using an open standard like TEI? And if it can't be the same, isn't it time to question the role of markup? There are many forms of human-unreadable XML - nobody would expect to look at an SVG file and intuit what the picture was, for example - and there may be nothing wrong with the fact that TEI markup is apparently evolving in that direction.

But if that's the case, we need reliable and intuitive ways of getting the information we want out of other people's markup. (Which is absolutely not the same thing as writing a script to turn XML into plain text.) You're right that markup enables editors to record decisions, but so does an apparatus criticus - and an apparatus criticus is nothing if not human readable.

Best wishes

Daniel



--[2]------------------------------------------------------------------------
        Date: Wed, 19 Dec 2012 21:32:06 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.599 XML & scholarship
        In-Reply-To: <20121218064951.3A7A03A33 at digitalhumanities.org>

Patrick,

I'll comment on a selection of your points, and try to be brief:

> how does using markup differ between the
> "born digital" and analog documents?

In "born-digital" markup is part of what I write. It is a fact. In
"born-analog-and-transcribed-to-digital", markup is an interpretation.
It is different every time the "transcription" is redone by someone
new. In born-digital markup is always the same. Although, as you point
out, I may use the same tools in processing both born-digital and
born-analog texts, the kinds of interaction between user and text in
the two cases will differ significantly. For example, in the
born-analog case we often request a facsimile side by side with its
transcription so we can verify its accuracy. In the born digital case
such a prop would be superfluous.

> Whether markup is standoff or embedded doesn't impact the attribution of
> explicit semantics to a text. Any number of linguistic annotation
> projects use forms of stand off markup.
>
It is true that "Standoff markup" has been used in linguistics since
the early 1990s. And simply removing XML tags from a text and later
putting them back doesn't change the status of the markup one iota.
It's still a tree and you can still only have one markup set at a
time. Being able to change one set for another is an advantage, but
having the two stored separately is equally inconvenient, so there is
no overall gain in usability.

But "standoff properties" are different, because they have no real
syntax they can be combined to enrich a text. The advantage is now
decisive: I can add markup sets A, C, and E to a text but not B and D,
and then format it, or I can choose B, C and D etc. and format that.
This is definitely an improvement because it increases flexibility
while providing a way to handle the ever-increasing complexity. This
cannot be done using embedded forms of markup.

> Then can you clarify from the documentation at the site I think
> Humanists readers should review, what explicit semantics are carried in
> your sets?
>
> Are the semantics of all of the TEI semantic markup available?

All TEI and other XML markup is available because it is imported
one-for-one. Elements become ranges and attributes become
"annotations" on the ranges. This feature is taken from the LMNL
*model*. The format itself is trivial. There is a description at
dhtestbed.ctsdh.luc.edu/hritinfrastructure/index.php/stil - not very
good perhaps but all I have at present as we continue to concentrate
on the software development.

> At least #2 on my list of requirements for a system for analysis of text
> would be the explicit preservation of semantics of the text as I
> interpreted it and analysis as I assigned it. In a form that can be
> reliably interchanged with others.
>
> To lack explicit semantics for textual analysis means scholarship
> returns to being an episodic enterprise that starts over with every
> generation guessing what may have been meant by the prior generation and
> laying the groundwork for their heirs to guess at theirs.
>
If by this you mean a standard and interchangeable format for
describing text hermeneutically I am well aware of the ideals long
voiced on the subject. But unfortunately "between the idea and the
reality ... falls the shadow."
What I see in TEI marked-up texts in practice is this: redefinition of
tags that already exist under a different name, new attributes added
willy-nilly when other ones already exist for the purpose,
output-related information embedded into supposedly reusable and
interchangeable texts, misuse of tags for the wrong purposes, and
general ignorance of what it says in the Guidelines because people
simply don't read them.
At the XML level yes we can interchange texts with other XML programs
for parsing and searching but can we interchange or interoperate texts
at the level of subjective markup? I don't think so. And if you don't
believe me read Syd Bauman's excellent piece in Balisage 2011, or
Martin Mueller's open letter to the TEI. They know better than I do
what they are talking about.

http://ariadne.northwestern.edu/mmueller/teiletter.pdf
http://www.balisage.net/Proceedings/vol7/html/Bauman01/BalisageVol7-Bauman01.html

Wendell,

>There is a bit of an oversimplication of the matter with LMNL here.
>It's true as far as it goes about the proposed LMNL syntax ("sawtooth"
>or "sabertooth" syntax, as it's been called), which is indeed an
>embedded markup syntax. (Except the point about LMNL syntax not
>allowing free intermixing of sets of ranges over a text. It does.
>Desmond, who disparages embedded markup altogether, will argue that it
>won't be practicable or pretty, but that's a different debate.)

You already know I think the "sawtooth syntax" is not a computer
recognisable language because it apparently has no grammar that
governs its entire syntax. The equivalence of "has a grammar" and "is
computer recognisable" was acknowledged to be already "well known" by
Chomsky in 1959. So I don't think the sawtooth syntax can do what you
claim. However, I have no significant objection to the LMNL model
itself; in fact it is rather clever.

>And while I happen to agree with
>Desmond that range models are very promising for all kinds of research
>applications (including some for which XML is not well-suited), I
>don't agree that a range model per se is going to solve this problem.
>On the contrary, it might make it even worse.

Ranges may not be the answer to everything but they they neatly
describe textual properties, and that's a large part of the problem.
But I'd agree ranges make markup worse if they are embedded. So just
don't embed them.

Desmond Schmidt
eResearch Lab
University of Queensland



--[3]------------------------------------------------------------------------
        Date: Wed, 19 Dec 2012 16:26:00 +0000
        From: Martin Mueller <martinmueller at northwestern.edu>
        Subject: Re:  26.605 XML, TEI and what kind of scholarship?
        In-Reply-To: <20121219074123.26E34DC1 at digitalhumanities.org>

"Some" may be a useful word to keep in mind in this discussion. Many
literary scholars are unlikely to have much use for the TEI, because they
also have little use for for the idea of text as a computationally
tractable object (CTO). For good and bad reasons, this is unlikely to
change anytime soon. From some ultimate perspective it  may be "laughable
nonsense" to think of a text as an "ordered hierarchy of content objects,"
but for many purposes, this assumption works well enough, and in a
practical world "works well enough often enough" will always trump
existential "is."

For some literary scholars, text as CTO is an attractive working
hypothesis. They use terms like "distant reading," "macro-analysis", or
"scalable reading." They tend to be quite bad at talking in a language
that their skeptical colleagues feel like listening to. Not for them the
wisdom of Bill Clinton's "you must put the corn where the hogs can get at
it." On the other hand, those skeptical colleagues are also not very good
at preaching beyond the choir.

If you think that "text as CTO" is often helpful, the TEI question is
inflected differently. It becomes the question "What value does TEI
encoding add to the computational tractability of texts?" or "What query
potential is created by TEI encoding and how large is the community that
can benefit from such encoding?" The NLP folks, who certainly believe in
text as CTO, tend to answers those questions with "little or none." And
the first thing  they do with an encoded text is to throw away the
encoding so that they can use their routines on the raw text or add their
own annotations. On the other hand, in their excellent _Natural Language
Processing with Python_  Bird, Klein, and Loper introduce Conditional
Frequency Distribution as their first substantial analytical tool.  They
teach you how to compare samples from different "genres" in the Brown
corpus. From that perspective, TEI encoding offers potentially a powerful
tool for enhancing computational tractability: it lets you divide a
digital object into its elements and aggregate those elements across
different texts. Alas, there are still very few tools that let literary
scholars perform those operations. More accurately, there are such tools,
but they typically have a much steeper learning curve than literary
scholars are willing to cope with

"Coarse but consistent" are the guiding words if encoding is to deliver
scholarly benefits that go beyond the ad hoc purposes of a particular
project. If you want to ponder the 'haecceitas" of a single text or a few
texts, you are much better off with a book. The late Philip Stone
somewhere quotes a definition of science as "the systematic throwing away
of evidence." The point of encoding is not to encode everything in a text,
but to mark some features in such a way that a machine can retrieve
different occurrences of the "same" feature across more texts than a human
could possibly read. Dumb, but fast and accurate retrieval of coarse
features across large data sets. Tossing is the cost of keeping in such an
enterprise. There is much wisdom in Desmond Schmidt's recent comment that
"At the moment, XML files in the humanities are proportionally less useful
to others the more markup is embedded in them, because they become a
specific representation of the work of one researcher, which interferes
with the work of another." Is there a sweet spot of baseline encoding
jointly created by scholarly communities for the purpose of supporting
"agile data integration" as an "engine that drives discovery"? I quote
Brian Athey, a professor of medicine at Michigan, who in the same talk
said that "It¹s difficult to incentivize researchers to share data"
(http://blog.orenblog.org/2011/07/19/brian-athey-big-data-2011-rdlmw/).

I don't know whether there is such a sweet spot. The life scientists by
and large believe that there ought to be and work towards it, while
recognizing the many difficulties. Many humanists seem to believe that
there should not be such a sweet spot in the first place. But that may be
their problem. 

Professor emeritus of English and Classics
Northwestern University

On 12/19/12 1:41 AM, "Humanist Discussion Group"
<willard.mccarty at mccarty.org.uk> wrote:

>                 Humanist Discussion Group, Vol. 26, No. 605.
>            Department of Digital Humanities, King's College London
>                              www.dhhumanist.org/
>                Submit to: humanist at lists.digitalhumanities.org
>
>
>
>        Date: Tue, 18 Dec 2012 09:21:46 +0000
>        From: Willard McCarty <willard.mccarty at mccarty.org.uk>
>        Subject: XML, TEI and what kind of scholarship?
>
>
>My King's colleague Elena Pierazzo's message several days ago drew much
>needed attention to the disciplinary perspective from which the question
>of markup is considered. She made the valuable point that systematic
>markup offers the textual editor the ability to record minute decisions
>at the location in the text where they are made. In the job-defining
>role as *editor* an editor must decide about this or that variant, mark of
>punctuation etc, but without markup and the computing which goes with it
>there is no way of recording decisions at the minute level of detail at
>which they are made. With it these decisions can be recorded. (Textual
>editors who know better please contradict.)
>
>Another colleague, whose passion is ancient inscriptions, pointed out to
>me some time ago that markup is similarly well-suited to epigraphy --
>because of what she called the "reporting function" of that discipline.
>The epigrapher witnesses and publishes surviving inscriptional evidence
>while it still exists, before someone defaces it, carts it away and
>sells it on the black market, weather wears it away or whatever. The
>epigrapher provides material for the benefit of other scholars. Markup
>and associated technologies are a godsend.
>
>For the literary scholar, however, interpretation is a different matter,
>requiring a very different disciplinary style and making very different
>demands on the technologies we devise to assist it. My 10 or so years
>devoted to markup (pre-TEI) taught me that it is not in principle
>well-suited to the literary critic's interpretative practices. Jerome
>McGann has made this point forcibly numerous times.
>
>To a publisher text as an "ordered hierarchy of content objects" makes
>perfect sense. To a literary critic it is laughable nonsense. To a
>philosopher it is an interesting hypothesis, I would suppose, whose
>implications need working out. To an historian it is evidence of people
>thinking in a particular way at a particular time, raising the question
>of how they came to think thus.
>
>In the digital humanities we are sometimes overly impressed by the
>portability of our methods and tools. We fail to see that when a method
>successful in one discipline is ported into another the game it is
>intended
>to play is different. The criteria which it must meet and the meaning of
>the
>terms in which scholars think are different. Just as platform-independent
>informational text cannot be known except by means of some platform or
>other
>(the term itself is wrong), computing is meaningless to the scholar unless
>manifested within the basic disciplinary context within which he or she is
>operating. Crossing the boundary of an epistemic culture successfully
>involves a complex blend of learning and teaching in what Peter Galison
>has
>usefully called a "trading zone" -- for which see Michael E. Gorman, ed.,
>Trading Zones and Interactional Expertise: Creating New Kinds of
>Collaboration (MIT Press, 2010).
>
>I think we still have a great deal to learn by studying and honouring
>what 
>scholars in various disciplines do.
>
>Comments?
>
>Yours,
>WM
>--
>Willard McCarty, FRAI / Professor of Humanities Computing & Director of
>the Doctoral Programme, Department of Digital Humanities, King's College
>London; Professor, School of Computing, Engineering and Mathematics,
>University of Western Sydney; Editor, Interdisciplinary Science Reviews
>(www.isr-journal.org); Editor, Humanist
>(www.digitalhumanities.org/humanist/); www.mccarty.org.uk/


--[4]------------------------------------------------------------------------
        Date: Wed, 19 Dec 2012 14:43:25 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re: [Humanist] 26.605 XML, TEI and what kind of scholarship?
        In-Reply-To: <20121219074123.26E34DC1 at digitalhumanities.org>

Dear Willard,

I've already said a lot in this thread, much of it I'm afraid too
obscure or high-flown to be of much interest to anyone but
specialists.

But you did ask for comments, and I have two that I think are important.

First: while  I'm sympathetic with what you say as far as it goes, I
don't think it does justice to all the considerable *indirect*
benefits of our present efforts. While markup technologies may not
offer any means of directly addressing all conceivable (or even all
known) requirements of scholarship -- well. What does? Markup
technologies may still provide a technological basis for much activity
that they do not directly support. Being able to print Shakespeare in
editions of tens of thousands surely promotes Shakespeare. Do we fault
the technology of print for not helping students memorize the lines or
act out the plays?

I know this doesn't really speak to your point, but I think it does a
disservice to ourselves to forget it.

Secondly, and much closer to the ground, I also feel that your
criticism is undermined by something you didn't say, namely your
implication that markup perforce imposes an OHCO (orderly hierarchy of
content objects) view over the text.

While this is (mostly) true historically, it is not true necessarily,
and other applications of markup are conceivable. (This was evident to
me, schooled in literary criticism, the first day I saw a generalized
syntax for descriptive text encoding.) Moreover, it is important that
we explore these lest we leave a powerful tool on the bench untried.
(Just because you've always used your knife for fruit doesn't make it
unsuited to cheese.)

It should go without saying that such markup would not be XML, which
does indeed (due to its grammar) impose an OHCO -- meaning a text must
either be reducible to such a hierarchy, or be (entirely) represented
by means of such a hierarchy. (And while it's true that a free-form
hierarchical database can be used to describe just about anything,
there's a big difference between using XML in this way and using it
for "markup".)

Desmond is on record as opposing the use of embedded markup
altogether, for reasons you hint at as well as others. But with
respect to how we might better *model* the text and the information a
text encodes (defining "text" broadly here), he and I agree on a great
deal. The OHCO model is a convenience for some things and no help for
others. By no means is it suited to support everything scholars wish
to do.

But please don't identify markup as such with the OHCO thesis. It
wasn't ever thus, and it doesn't always have to be.

Indeed, if developing computational methods for data descriptions that
do not impose an OHCO is an important part of the project you propose
(respecting, learning from and supporting disciplinary practices to
which our present tools are unsuited), then we really need to
understand which side of the line markup is on -- markup conceived
broadly, and not just as SGML/XML -- and whether there might also be
applications of markup that have been blocked, not enabled, by our
present toolkit. Of course, even markup conceived broadly won't be
useful for everything. But I believe we have been blinded by XML to
what other kinds of markup might be good for.

Yes, I am proposing an avenue for research.

Cheers,
Wendell


-- 
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^





More information about the Humanist mailing list