[Humanist] 26.648 XML &c

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Jan 4 09:55:37 CET 2013


                 Humanist Discussion Group, Vol. 26, No. 648.
            Department of Digital Humanities, King's College London
                              www.dhhumanist.org/
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (25)
        Subject: Re:  26.644 XML &c

  [2]   From:    Jay Savage <jsavage at fordham.edu>                         (142)
        Subject: Re:  26.627 XML & what kind of scholarship

  [3]   From:    Wendell Piez <wapiez at wendellpiez.com>                     (63)
        Subject: Re: [Humanist] 26.644 XML &c

  [4]   From:    drwender at aol.com                                          (46)
        Subject: Re:  26.644 XML &c


--[1]------------------------------------------------------------------------
        Date: Thu, 3 Jan 2013 21:45:13 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.644 XML &c
        In-Reply-To: <20130103082131.55089DEB at digitalhumanities.org>

James,

> What you refer to are differences of interpretation of the italicized text itself,
> a fact which is common to both print and digital media

The interpretation is not a "fact" of the printed text. That just has
has black marks on a page, whereas the digital text
<emph>really</emph> has an explicit interpretation encoded in digital
bytes, which states: "the word 'really' is emphatic". Where is that
information in the printed text, and not just in your head as you read
it?

Herbert,

> If the manuscript were lost as in so many other cases, the so-called facts would be the confusing interpretations by contemporary typesetters.

If you recall I said that any change of medium for a text involves
interpretation, and that's what you are also saying about the Büchner
texts. In going from manuscript to print the typesetters had to
interpret what they saw just as an XML transcriber has to. And if the
manuscript was then lost all we would have would be the printed texts,
which would then be our only "facts". And then you'd have to
*conjecture* that they were typesetter's misinterpretations rather
than what the author wanted. This reminds me of the Byzantine
manuscripts of Aeschylus, which are full of interpolations, but they
are still facts because in many cases they are all we have. Facts
don't have to be true, they can just be things that exist.

Desmond Schmidt
eResearch Lab
University of Queensland



--[2]------------------------------------------------------------------------
        Date: Thu, 3 Jan 2013 11:49:53 -0500
        From: Jay Savage <jsavage at fordham.edu>
        Subject: Re:  26.627 XML & what kind of scholarship
        In-Reply-To: <20121228085606.52148F99 at digitalhumanities.org>


Hi Desmond,

Quite the opposite, actually: I am arguing that markup is never neutral.
Whether an originary author/editior/publisher/what-have-you embeds markup
10 seconds after a word is first typed or someone comes along 400 years
later makes no difference. Markup is always an act of interpretation, and
just as suspect for a contemporary text as for an ancient one.

Let me ask three questions (and a number of subquestions). First, why
should we care what the author of an electronic text does or does not do or
intend? In the case of Shakespeare, you are quite comfortable reviewing and
cataloging the physical features as they exist and leaving it at that. Why,
confronted with an electronic text, make recourse to questions of authorial
intent? Shakespeare's manuscripts (or whatever Hemings and Condell had
access to) were marked in ways that led his first readers--his editors--to
interpret the text in certain ways that influenced their reading and
subsequent typesetting. Perhaps certain words were underlined or
capitalized; we will never know for certain, of course, but we can guess
based extant 17th-century manuscripts. Other decisions were based the
styles of the day and the printer's house style. All of those decisions
resulted in a complex bibliographic code designed to communicate
information to readers. Your own text, likewise, has markup embedded that
you hope will influence readers' interpretation based on their
understanding of the conventions and technology of XML parsing. In both
Shakespeare's case and your own, whether the intended information is
successfully communicated to the reader is quite outside the author's (or
editor's) control. Why should we privilege the contemporary author's
intentionality more than we do Shakespeare's?

Aside from the obvious difference in the technique used (tagging vs.
underlining), how is our contemporary markup truly different from
Shakespeare's? Even if we do privilege intentionality, is your intention
truly more accessible to readers, and less open to interpretation? Do your
readers really understand "<emph>" means "emphasis" and not just "the the
shortcut I use because I know that my XML will eventually be converted to
HTML and most web browsers render '<emph>' as italic"? Personally, I
generally assume the latter. There are certainly conventions for markup,
but they are hardly universal.

Sceond, staying focused on markup, can we really assume that contemporary
authors/editors/publishers control the markup in the texts they produce? I
would suggest not. Most electronic texts are marked up programmatically in
ways that are only loosely under human control. An author producing a
manuscript in Microsoft word never sees the lines of markup insertion
triggered byt the "i" button on the toolbar. A majority of users, in fact,
would probably be surprised to learn that the "x" in ".docx" even stands
for "XML." Most bloggers have no idea what systems like Wordpress do to
their formatting behind the scenes. Those that do are normally horrified by
what they discover. Even the most careful TEI practitioners are
occasionally caught off guard by the idiosyncrasies and "features" of
<oXygen/>, SAX, etc..

Finally, how is "analog" truly a useful category of text? Are
Ojibwe wiigwaasabakoon really more akin to the Illustrated London News than
11 fascicles of the the original New English Dictionary are to the 20
volumes of the Second Edition (OED), which was composed in SGML with IBM's
LEXX software and electronically composited before being printed to paper?
Again, I would argue not. To me, drawing an arbitrary analog/digital
distinction in this way deliberately and problematically elides--for
rhetorical purposes--very important distinctions between a wide variety of
entirely dissimilar manual and mechanical reproductive technologies, not to
mention the distinctions between very different "digital" technologies. It
also ignores the obvious similarities and continuities between print and
post-print reproduction, and obscures the last fifty years of development
in printing technology, where digital devices and methods gradually came to
dominate all aspects of printing technology. Is a typescript from a Xerox
Memorywriter an analog text, or a digital one? If it is analog, how long
does text have to be resident in memory before it becomes digital? What of
the OED2? At the same time, can an archived RFC 822 email from 1982 be
usefully compared to an contemporary XML-TEI edition simply because they
are both "digital"? I would suggest, instead, that we move in the direction
of what Katherine Hayles terms Media Specific Analysis that treats myriad
reproductive technologies, whether colloquially "analog" or "digital," on
their own terms.

I may not have been clear in my initial post, but I certainly don't think
it is easy to move from digital to print and back again. I think it is very
difficult, and that precisely the information you identify is lost. I don't
think it is easy to move from manuscript to print and back again, though,
either. Nor do I think it is easy to move from flexograph to hand press and
back, or from photostat to rotogravure and back. Even moving from
manuscript to manuscript we see issues of scribal error and the difficulty
of producing faithful diplomatic transcriptions. All transcription and
translation are fraught. It's the arbitrary distinction of between "analog"
and "digital" that I don't find compelling, precisely because it implies
some innate similarity between the various "analog" technologies and
between the various "digital" ones. Compositing lines of type in reverse on
a stick and creating etched plates from original watercolors seem to me to
be at least as transformational encoding a text into XML or scanning an
image to JPEG, though. At the same time, converting an HTML text to PDF
seems just as transformational as verse from a bible into a commonplace
book. All remediation is destructive, even between "analog" media and
between "digital" media. We need to not loose sight of that.

(I would, though, argue that their *are* certain shared features common to
all electronic artifacts at the level of the physical and computed
substrates that do not affect this particular discussion of human-readable
text and human-interpretable, human-edited markup, and that we should
really rethink what we mean by "etext." But that is a conversation for a
different day.)

In any case, I am deeply uncomfortable with the way that "markup" and
"digital" seem to be taken as synonyms in this conversation. They are two
very different concepts. Even if we accept the existence of an
analog/digital divide, there are many ways to create "digital" texts. Very
few of them include structured markup. Even those that do rely on other,
more fundamental computed processes. Distinguishing between the subject and
object of interpretation with electronic texts is a battle that is lost
before it is even begun. The markup itself is already encoded in utf8 or
another scheme. There is simply one CTO which we attempt to parse logically
into "text" and "markup." Doing that successfully, though, requires a
priori knowledge of either the markup or the "plaintext," or preferably
both.

Likewise, physical texts have been physically marked up for millennia. A
great deal of our contemporary XML and HTML markup is, in fact, simply
aimed at reproducing typographical conventions. I think one of the real
roots of the issue is that the creators of SGML and its derivatives, among
others, have assumed that bibliographic features not only have interpretive
value, but have consistent, predictable interpretive value. This is
demonstrably false, and attempting to assign Berners-Lee-type "semantic"
value to markup has been problematic.

Physical texts were never "facts." They are complex, messy containers for
bibliographic codes that are subject to interpretation. They exist in
variants and editions that must be accounted for if their stories are to be
told fully. We can't expect electronic texts to be be entirely different
and semantically stable. I don't personally find that cause for alarm. It
simply recasts and expands familiar questions of text and work, peritext
and paratext.

Nor do I think that the subjective nature of textual criticism is a reason
to despair of computational tractability or open exchange. LMNL is one
potentially interesting solution overlapping ranges. TEI provides
invaluable contributions in other ares. There are man others, each with
their own strengths and weaknesses. The trick, I think, will be to give up
on the TEI ideal of One Spec To Rule Them All and put effort instead into
developing expertise in translating between specific cases as needed. We do
this routinely with many kinds of data, to great profit.

We just need to be clear on the true nature of the tenuous
relationships (or lack thereof) among the very different concepts of
"text," "encoding," "markup," and "meaning."

Best,

--j

----------------------------------------
Jay Savage, Ph.D.
Director of Academic Information Technology Services
Yeshiva University ITS

jsavage1 at yu.edu
(646) 592-4092

"You can't just ask customers what they want and then try to give that to
them. By the time you get it built, they'll want something new." --Steve
Jobs



--[3]------------------------------------------------------------------------
        Date: Thu, 3 Jan 2013 13:30:09 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re: [Humanist] 26.644 XML &c
        In-Reply-To: <20130103082131.55089DEB at digitalhumanities.org>

Jim R, Willard, and HUMANIST:

On Thu, Jan 3, 2013 at 3:21 AM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> sent:
> Thanks much for the response, but if the output is italics in all cases,
> how is the encoding interpretation in this case?  What you refer to are
> differences of interpretation of the italicized text itself, a fact which
> is common to both print and digital media.  Italics, obviously, can be used
> for foreign words, emphasis, stage directions, titles, etc., but this
> variety of uses pre-existed digital media and is only reflected by it again
> now.  Once the decision to italicize has been made, what difference does it
> make what code you write to produce italic text?

This is actually a nice example illustrating an important principle of
markup language design and application. To markup old-timers, it's
embedded so deeply that we forget it. But as time passes and context
changes it's important to revisit the issue.

Let's say for purposes of argument that on a far-off planet somewhere
they use an encoding scheme called "TEI". (It stands for "Transcribing
Evolving Information". This civilization is oriented towards
processes, verbs and ambiguity, not things, nouns and fixity, and they
like gerunds in their acronyms.) It offers the following element types
(in an encoding scheme close enough to XML for our purposes), among
others:

emph - inline content that is emphasized rhetorically (indicated
generally by a typographical shift such as italics appearing in the
midst of roman)

foreign - inline content receiving a similar typographical distinction
in order to indicate the foreign-language origin of a word or phrase

title - the title of a literary or creative work appearing in line,
possibly distinguished typographically

hi - inline content that is typographically distinct, for any reason
or for none discernible or worth communicating

Jim's question suggests that even if this profusion of possibilities
is not problematic (insofar as the categories are weakly specified,
arguable, and overlapping), it is unnecessary, because it all comes
out in italics in the end, and it's properly up to readers, not the
system, to attribute any semantics to any italics (as to any
typography) they see.

On a practical level, this is fair enough, but only as long as
rendering in italics is the only operational effect of this tagging.
As soon as we want to distinguish between these cases -- for example,
for indexing, text analysis, to color some (not all) of them pink, or
for any other purpose -- then the distinctions become meaningful.
"Interpretation" (such as the choice between 'emph' and 'foreign'),
with all its hazards, becomes "data" (input to the system).

If you are crying "tautology" at this, you are right. Assuming you
have no such need right now, the question is when or whether that
moment ever comes. A related question is whether it is less onerous to
prepare for the event now, or wait until it happens to deal with it
(or better, make someone else if they care to). Sometimes the need is
foreseeable and the labor is comparatively light, but there is always
a limit to what you can do.

This design problem is especially challenging when using a system that
must be wired up in advance. XML is more flexible in this respect than
many data processing systems. But because all the elements in an XML
document must fit neatly together, it is less flexible than we might
sometimes like.

Cheers,
Wendell

--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^



--[4]------------------------------------------------------------------------
        Date: Thu, 3 Jan 2013 17:04:31 -0500 (EST)
        From: drwender at aol.com
        Subject: Re:  26.644 XML &c
        In-Reply-To: <20130103082131.55089DEB at digitalhumanities.org>


 Dear James, you wrote:
   > The context I had in mind was digital reproduction of originally printed 
   > text, though, with the person doing the encoding having to decide how to 
   > encode italic text. In this (I suppose: scholarly) context the encoding enthusiast can say: "While the traditional book editior remains in the opaqueness of italization, leaving interpretation on the reader side, the encoding scholar makes explicit what s/he supposes to be given implicite in the typographical 'fact'." ('Befdund': italics - 'Deutung': emphasis / foreign language item / ...)
 
 In terms of the speech ac theory: In old-fashioned book editions the edited text tends to remain on the level of locuinary acts, reproducing the 'original' (in my discipline: literary) message; this holds also for some digital surrogates, f.e. the CD by-packed to the famous facsimile edition of Kafka's "Der Process" where faxs+transcript are shipped-out as PDF resulting out of a desktop publishing workflow. XML/TEI fasioned digital editions, on the other side,  tend to ask for the illocutionary force too: what the author / typesetter / editor has done by italicizing a span of text?

In the course of Desmond's critical view we can now ask
(1) how appropriate are those (embedded or stabd-off) descriptions of illocutionary acts?
or (2) how reasonable they are taken in their own right as perlocutionary acts in a scholr-to-scholar communication?
or (3)  - for me the crucial quwstion - how to judge the costs of blind tagging (without knowledge of the processes later on handling all the pretty XML files)?

I'm wondering about the future of this _very_ pleadant thread.
With kind regards, Herbert



More information about the Humanist mailing list