[Humanist] 26.599 XML & scholarship

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Dec 18 07:49:51 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 599.
            Department of Digital Humanities, King's College London
                              www.dhhumanist.org/
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Patrick Durusau <patrick at durusau.net>                    (126)
        Subject: Re:  26.596 XML & scholarship

  [2]   From:    Wendell Piez <wapiez at wendellpiez.com>                     (82)
        Subject: Re: [Humanist] 26.596 XML & scholarship


--[1]------------------------------------------------------------------------
        Date: Mon, 17 Dec 2012 10:12:23 -0500
        From: Patrick Durusau <patrick at durusau.net>
        Subject: Re:  26.596 XML & scholarship
        In-Reply-To: <20121217063336.B39FF39E2 at digitalhumanities.org>


Desmond,

On 12/17/2012 01:33 AM, Humanist Discussion Group wrote:
>
> Patrick,
>
> Before going any further we should distinguish between the creation of
> digital documents for online use and the deliberate assignment of an
> "explicit semantics" to analog documents that never had any
> syntactical structure (in the computational sense) when they were
> created. I think a lot of your objections can be traced to the
> confounding of these two separate roles for markup.

Thanks for that comment.

I don't think of markup as having separate roles for "born digital" or 
analog documents. But would not have thought to say so save for your 
comment.

Consider that our email exchange is a series of "born digital" 
documents. It could have easily but more slowly have been a series of 
hard copy letter exchanges.

Assuming we agree that explicit semantics, such as resolving entity 
references, could be useful, how does using markup differ between the 
"born digital" and analog documents?

> I should also clarify that I'm not talking about a language like
> Wendell's embedded LMNL, but of a model in which markup ranges are
> held externally to the text in sets. Since there is no syntax the sets
> can be freely mixed. I can have one set for recording links to a set
> of images, another to hold basic textual structure, another for a
> reference system, another for annotations etc. And I can mix these
> sets freely and augment them with my own because I have no fear of
> overlap that is built into the design. You can't do any of this in
> XML. You can only have ONE set at a time: the ever-increasing
> complexity of what I want to record must all go into one file,
> conforming to ONE syntax, along with the text that is obscured by it.

Whether markup is standoff or embedded doesn't impact the attribution of 
explicit semantics to a text. Any number of linguistic annotation 
projects use forms of stand off markup.

The "one set" problem of XML is an artificial constraint inherited from 
SGML, mostly due to insufficient understanding of parsing practices 
current when SGML was written.

Still, you have a valid argument that it does exist and if you want 
multiple sets of markup, all attributing explicit semantics to a text, 
and use stand off markup to do so, that is a valid choice.

But I continue to point out the need for explicit semantics, a bit more 
on that below.

>> Implied semantics are *lossy* recording of semantics because there can
>> be no accumulation of analysis on top of implied semantics nor any
>> reliable interchange of the underlying artifact.
> I'm not talking about building anything "on top of implied semantics"
> but on top of text. Semantic markup already performs this role
> successfully, and standoff properties are based the same basic idea.
> The underlying artefact is a UniCode text file. Why can't you
> interchange that? As for the separate markup sets why are they any
> more or less interchangeable than XML?

Then can you clarify from the documentation at the site I think 
Humanists readers should review, what explicit semantics are carried in 
your sets?

Are the semantics of all of the TEI semantic markup available?

I ask because when I reviewed the documentation, that did not appear to 
be the case.

It isn't simply a matter of interchangeable encoding, such as Unicode 
that makes a text "interchangeable" in the TEI sense.

It is the provision of explicit that allow me to recover, extend, use, 
disagree with, whatever semantics you have attributed to a text. I think 
that is closer to the sense of "interchange" in the TEI sense.

>> * We should be mindful that "simple and works" is a poor basis for
>> format/program design. The original presumption of well-formed XML was
>> made in deference to programmers who could write an XML parser in a
>> weekend.
> Simplicity is of course the basis all good design. I admit that this
> was the premise of XML *originally*, but since then the W3C has piled
> documentation and complexity on top of XML that made it anything but
> simple. You really should read what Tim Bray said about the "bloated,
> opaque and insane complexity" of XML Web services already in 2004, or
> more recently James Clark (2010) on why XML has got out of hand and is
> now bad at solving the problems it was designed for. And these guys
> created XML if anyone did. The OOXML specification you mention is over
> 6,000 pages long.
>
> http://blog.jclark.com/2010/11/xml-vs-web_24.html
> http://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo

On OOXML being 6,000 pages long, you may want to read my "The 6,000+ 
Page Myth" at: http://www.durusau.net/publications/6000pagemyth.pdf

I re-edited the Word Processing part of OOXML from 1780 pages to 452 
pages (reduction of approximately 74%) by: "Compare that with an edited 
version that changes the line spacing, font size, removes duplicate 
text, and reformats the listing of references."

Even greater savings were possible but I did not want to change any of 
the substantive text.

>> While I recognize the shortcomings of XML, the loss of explicit
>> semantics, by whatever means, is a cure worse than the disease.
>>
> There is no technological need for "explicit semantics" in cultural
> heritage texts. It is alien to them. All that matters, as you point
> out at the start of your post is that we must answer the needs of the
> user. The chosen technology only has to facilitate that. But XML
> actually gets in the way.

Ah, so our disagreement isn't so much about XML as it is about the 
requirements for analysis of texts.

At least #2 on my list of requirements for a system for analysis of text 
would be the explicit preservation of semantics of the text as I 
interpreted it and analysis as I assigned it. In a form that can be 
reliably interchanged with others.

That is others don't have to guess at what I may have meant by a change 
in fonts, italics or no, bold or no, divisions in the text, alignments 
or their absence, etc.

To lack explicit semantics for textual analysis means scholarship 
returns to being an episodic enterprise that starts over with every 
generation guessing what may have been meant by the prior generation and 
laying the groundwork for their heirs to guess at theirs.

I stand by the requirement to meet the needs of users, but users need 
tools that assist them in stating their analysis of a text, for future 
generations to agree, disagree or extend.

XML doesn't have get in the way of that process. Unless you make a 
fetish out of users typing XML markup.

Hope you are having a great week!

Patrick

-- 
Patrick Durusau
patrick at durusau.net
Technical Advisory Board, OASIS (TAB)
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau



--[2]------------------------------------------------------------------------
        Date: Mon, 17 Dec 2012 11:46:13 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re: [Humanist] 26.596 XML & scholarship
        In-Reply-To: <20121217063336.B39FF39E2 at digitalhumanities.org>

Patrick, Desmond and HUMANIST,

It's a temptation to say a great deal more but I'll limit myself today
to simply qualifying, and perhaps complicating, a couple of Desmond's
most recent points.

He says:
> I should also clarify that I'm not talking about a language like
> Wendell's embedded LMNL, but of a model in which markup ranges are
> held externally to the text in sets. Since there is no syntax the sets
> can be freely mixed. I can have one set for recording links to a set
> of images, another to hold basic textual structure, another for a
> reference system, another for annotations etc. And I can mix these
> sets freely and augment them with my own because I have no fear of
> overlap that is built into the design. You can't do any of this in
> XML. You can only have ONE set at a time: the ever-increasing
> complexity of what I want to record must all go into one file,
> conforming to ONE syntax, along with the text that is obscured by it.

There is a bit of an oversimplication of the matter with LMNL here.
It's true as far as it goes about the proposed LMNL syntax ("sawtooth"
or "sabertooth" syntax, as it's been called), which is indeed an
embedded markup syntax. (Except the point about LMNL syntax not
allowing free intermixing of sets of ranges over a text. It does.
Desmond, who disparages embedded markup altogether, will argue that it
won't be practicable or pretty, but that's a different debate.)

However, LMNL itself is a model, and the proposed syntax is only one
way of representing it. Indeed, any of the other proposed or more or
less familiar way of representing ranges over text, including standoff
markup and out of line annotations, can be mapped into the LMNL model,
which is capable of supporting the same kind of radical concurrency
that Desmond describes here.

I apologize to any readers who wish to know more, since I don't feel
this is the best venue to report on my progress working with what
remains an *experiment* in markup and its applications.

Desmond continues:
> There is no technological need for "explicit semantics" in cultural
> heritage texts. It is alien to them.

I don't think the question here regards cultural heritage texts as
such but rather their digital surrogates and representations.

Yet in agreement (I think) with Desmond, it's important to keep in
mind that the aspiration of many scholars and initiatives has been to
produce encoded texts that are specifically *not* locked into
application semantics of any kind, including even display semantics.
While this has not been the dominant trend everywhere, the idea of an
application-neutral and independent encoding is still strong in
initiatives like the TEI -- i.e. an encoding scheme whose semantic
bindings are loose and (more or less) freely reconfigured for and in
application.

Who was it that just quoted Bateson, defining a "bit" as "a difference
that makes a difference"? Another way of putting this is that the
semantics of an encoding scheme like TEI are its own, and are properly
meaningful only within the semantic context of TEI itself; moreover,
within that context the need to specify an alignment of TEI
descriptive semantics with presentational, procedural or any
application semantics is regarded as a system feature, not a bug.

Of course this is part of what bothers some people about it (Doug?),
and Patrick surely has a point that for most users, having an encoded
text as such isn't enough. We need the encoding to present at least
enough of an application binding (to something in our system that
"means" something and "does" something) to be able to engage the gears
(presumably, to display the text without obfuscating syntax and work
with it in the ways we want).

I guess I'm just old-fashioned in thinking that both are possible --
even while all the ways we want to work with the text may be
innumerable. The encoded text can be workable in applications, and yet
hackable nonetheless. (Interesting too that while Rome wasn't built in
a day, actually look at the history of Rome's building and you'll see
a series of incomplete efforts to pave over what went before.
Similarly, I doubt that anyone's effort to define one single format To
Rule Them All is going to succeed.) And while I happen to agree with
Desmond that range models are very promising for all kinds of research
applications (including some for which XML is not well-suited), I
don't agree that a range model per se is going to solve this problem.
On the contrary, it might make it even worse.

Or even better, if you regard the "problem" of the specification,
implementation and communication of the semantics of our encoding not
as a problem at all, but as a set of opportunities. :-)

Cheers,
Wendell

--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^





More information about the Humanist mailing list