[Humanist] 26.596 XML & scholarship

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Dec 17 07:33:36 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 596.
            Department of Digital Humanities, King's College London
                              www.dhhumanist.org/
                Submit to: humanist at lists.digitalhumanities.org



        Date: Mon, 17 Dec 2012 07:09:44 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.594 XML & scholarship
        In-Reply-To: <20121216103255.DE88C39E8 at digitalhumanities.org>


Patrick,

Before going any further we should distinguish between the creation of
digital documents for online use and the deliberate assignment of an
"explicit semantics" to analog documents that never had any
syntactical structure (in the computational sense) when they were
created. I think a lot of your objections can be traced to the
confounding of these two separate roles for markup.

I should also clarify that I'm not talking about a language like
Wendell's embedded LMNL, but of a model in which markup ranges are
held externally to the text in sets. Since there is no syntax the sets
can be freely mixed. I can have one set for recording links to a set
of images, another to hold basic textual structure, another for a
reference system, another for annotations etc. And I can mix these
sets freely and augment them with my own because I have no fear of
overlap that is built into the design. You can't do any of this in
XML. You can only have ONE set at a time: the ever-increasing
complexity of what I want to record must all go into one file,
conforming to ONE syntax, along with the text that is obscured by it.

> Implied semantics are *lossy* recording of semantics because there can
> be no accumulation of analysis on top of implied semantics nor any
> reliable interchange of the underlying artifact.

I'm not talking about building anything "on top of implied semantics"
but on top of text. Semantic markup already performs this role
successfully, and standoff properties are based the same basic idea.
The underlying artefact is a UniCode text file. Why can't you
interchange that? As for the separate markup sets why are they any
more or less interchangeable than XML?

> * We should be mindful that "simple and works" is a poor basis for
> format/program design. The original presumption of well-formed XML was
> made in deference to programmers who could write an XML parser in a
> weekend.

Simplicity is of course the basis all good design. I admit that this
was the premise of XML *originally*, but since then the W3C has piled
documentation and complexity on top of XML that made it anything but
simple. You really should read what Tim Bray said about the "bloated,
opaque and insane complexity" of XML Web services already in 2004, or
more recently James Clark (2010) on why XML has got out of hand and is
now bad at solving the problems it was designed for. And these guys
created XML if anyone did. The OOXML specification you mention is over
6,000 pages long.

http://blog.jclark.com/2010/11/xml-vs-web_24.html
http://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo

>
> While I recognize the shortcomings of XML, the loss of explicit
> semantics, by whatever means, is a cure worse than the disease.
>

There is no technological need for "explicit semantics" in cultural
heritage texts. It is alien to them. All that matters, as you point
out at the start of your post is that we must answer the needs of the
user. The chosen technology only has to facilitate that. But XML
actually gets in the way.

Desmond Schmidt
eResearch Lab
University of Queensland

On Sun, Dec 16, 2012 at 8:32 PM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
>                  Humanist Discussion Group, Vol. 26, No. 594.
>             Department of Digital Humanities, King's College London
>                               www.dhhumanist.org/
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Sat, 15 Dec 2012 06:42:04 -0500
>         From: Patrick Durusau <patrick at durusau.net>
>         Subject: Re:  26.586 XML & scholarship
>         In-Reply-To: <20121215105208.2DF9C3A28 at digitalhumanities.org>
>
> Desmond,
>
>>> Date: Thu, 13 Dec 2012 22:26:00 -0500
>>> From: Doug Reside <dougreside at gmail.com>
>>> Subject: Re:  26.577 Folger Digital Texts --> XML & scholarship
>>>> But then I think about all of the attempts I and others have made to
>>>> create "easy to use" XML editors that end up being less functional and
>>>> harder to use than a simple text editor.  Anyone with a modicum of web
>>>> design experience who has tried to edit HTML in WordPress or Drupal
>>>> usually starts hunting for the "edit source" button immediately.  It
>>>> feels like there SHOULD be a better kind of data entry tool for
>>>> text-encoding than an angle bracket editor, but I'm not yet sure what
>>>> it is.
>> Doug,
>>
>> I'm glad that someone else recognises the difficulty of this problem.
>> It seems like it ought to be possible to build a graphical editor for
>> TEI-XML, but with 544 or more tags it's impossible to translate all the
>> structures that humanists want to record and represent them all
>> graphically. Simple textual highlighting works, sure, paragraph
>> structures work, but variants, virtual joins, footnotes, links, etc etc?
>> Since you have to represent many tags as raw XML what happens if the
>> user makes a mistake? You'd have to handle that error right there in
>> your online editor, not when the text is sent to the server. You'd have
>> to provide context-sensitive editing, hundreds of pages of explanations
>> as to what each tag signifies, and explain to the user how to fix each
>> mistake. Not a simple task to program, and certainly not a simple task
>> to use it.
>>
>> The user need to have a simple editor cannot be met by XML.
>
> On the contrary, the error is starting from XML rather than the
> interface for the user. An XML instance is an artifact that records
> choices made by the user.
>
> As far as the complexity of TEI, consider that some of the attributes in
> OOXML have 200+ different contextual meanings, but bear the same
> attribute name. MS Word seems to handle that.
>
> Another error is assuming that the use of overlapping ranges is somehow
> less complex than XML in terms of representation.
>
> That is to say, whatever structure you came to need explanation in XML,
> if you are going to represent it with overlapping ranges, doesn't the
> user need the same explanation?
>
> Ah, but no, they most likely don't because with ranges, the semantics
> that are *explicit* in XML, can be left *implied.* (Not that they must
> be as I am sure Wendell will be quick to point out. Making semantics
> explicit is part of the "hardness" of XML but it is also part of what
> makes it useful. PDF has implied semantics but I would be loath to
> publish a critical edition using it.)
>
> Implied semantics are *lossy* recording of semantics because there can
> be no accumulation of analysis on top of implied semantics nor any
> reliable interchange of the underlying artifact.
>
>> You
>> have to think beyond it, and I believe a consensus is now emerging in
>> the digital humanities that at least the properties of text (NOT its
>> versions) can be practically represented as overlapping ranges. There
>> are quite a few projects now exploring this line of research: eComma,
>> CATMA, LMNL, our own standoff properties. It's not rocket science. It's
>> very simple, and it works. Check out our website austese.net/tests/.
>> Everything you see here is done without XML, from the server to the
>> visualisations, comparisons, everything. The only thing that handles
>> XML are the import tools, of course. So I don't believe that XML is
>> actually needed any more to get our work done.
>
> A very impressive demonstration, which Humanist readers should enjoy.
>
> But the question remains, how are the semantics of the structures
> documented?
>
> I agree "It's very simple and works." but that isn't my issue.*
>
> My issue is how 10, 20 or 200 years from now I will be able to make
> sense of the encoding and leverage further analysis on top of it. If the
> semantics are implied, ranges or no, I cannot reliably reuse a text.
>
> Hope you are having a great weekend!
>
> Patrick
>
> * We should be mindful that "simple and works" is a poor basis for
> format/program design. The original presumption of well-formed XML was
> made in deference to programmers who could write an XML parser in a
> weekend.
>
> It is "simple and works" but fails to account for structures that we can
> attribute to any text.
>
> While I recognize the shortcomings of XML, the loss of explicit
> semantics, by whatever means, is a cure worse than the disease.
>
> --
> Patrick Durusau
> patrick at durusau.net
> Technical Advisory Board, OASIS (TAB)
> Former Chair, V1 - US TAG to JTC 1/SC 34
> Convener, JTC 1/SC 34/WG 3 (Topic Maps)
> Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
> Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
>
> Another Word For It (blog): http://tm.durusau.net
> Homepage: http://www.durusau.net
> Twitter: patrickDurusau





More information about the Humanist mailing list