[Humanist] 26.615 XML & what kind of scholarship

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sat Dec 22 10:40:31 CET 2012

                 Humanist Discussion Group, Vol. 26, No. 615.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Wendell Piez <wapiez at wendellpiez.com>                     (94)
        Subject: Re: [Humanist] 26.612 XML & what kind of scholarship

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (55)
        Subject: Re:  26.612 XML & what kind of scholarship

        Date: Fri, 21 Dec 2012 11:17:41 -0500
        From: Wendell Piez <wapiez at wendellpiez.com>
        Subject: Re: [Humanist] 26.612 XML & what kind of scholarship
        In-Reply-To: <20121221090829.3BB193A28 at digitalhumanities.org>

Dear Desmond,

On Fri, Dec 21, 2012 at 4:08 AM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
> Willard is not mistaken. There are no practical markup languages
> embedded in the text that are not OHCOs, for otherwise they would not
> be computer recognisable languages.

Please understand I write with a great deal of caution and humility,
since I am painfully conscious of how much I don't understand about
all this. Fools rush in, etc.

Yet at the same time, I can't help but feeling as if you keep telling
me a unicorn isn't possible, while I am looking out my back window and
watching one nibbling flowers in the garden. Working with documents
marked up using LMNL syntax isn't theoretical for me: I'm doing it
every day to whatever extent I can make time and find faith. (I too
have plenty of doubts. They just haven't convinced me that trying this
out isn't worth the effort.)

Now, you tell me what I'm seeing isn't a unicorn, but something else
that simply happens to look and act just like a unicorn.

We aren't actually in disagreement about grammars and parsing. As I
wrote yesterday, LMNL syntax does have a grammar, and parsing it does
yield a tree -- it is a syntax tree, representing the tags in the
text. The difference is in what the parser is asked to do with that
information -- the "thing" (the model) that the tags in the text are
taken to represent, which is instantiated ("built" in memory) and
processed in subsequent operations.

A LMNL processor builds a range model -- a model exactly aligned in
every important respect (that I know of) with every standoff-based
model for attributing properties or annotations to text I have seen so
far, capable of describing arbitrary ranges with arbitrary properties
and annotations. Interestingly, this is done by forgoing what XML does
by inferring (from the sequence of tags) hierarchical relations
(parent, child, sibling etc.) among elements in the model. ("Element"
is a thing in the model. "Tag" is a thing in the syntax.) In LMNL (the
model), a range is a range, which may happen to be enclosed by another
range, or it may overlap it. All ranges in the document are peers, and
it is up to an application to build hierarchies out of them if it
wants to.

Consequently it has all the properties you like about range models,
including that it can support concurrent but disjunct descriptions of
the same text; multiple concurrent hierarchies; ranges with the same
name overlapping one another ("arbitrary overlap"); etc.

> What you are referring to are data
> structures that can be expressed using markup formalisms such as
> linking (i.e. using IDs to connect elements). I can represent a
> complete graph in XML that way but it doesn't mean that the XML
> language in question has such a structure. It's still a tree. The
> links themselves aren't part of the language. You can't write a
> grammatical rule that controls which elements an ID can connect to, or
> that the target must exist or that the links don't form a directed
> cycle etc. Since you can't syntax-check any of that, such files should
> be locked to prevent accidental damage. One way to achieve that is to
> use a binary format.

Indeed. I accept this -- indeed it describes the internals of many XML
processors as well (to say nothing of your web browser), in which the
tree itself is represented using pointers. Yet I am nevertheless
interested that embedded markup -- once you have a strong two-way
mapping between markup syntax and model (as XML has, just barely) --
provides another way to protect this fragile creature (just as XML
doesn't ask you to maintain pointers between an element parent and its
element children).

Internally, it is quite likely that a LMNL processor will maintain a
set of linked objects. But does not have to be exposed to the user any
more than your binary is. Instead, it can be generated by parsing a
set of tags embedded in text (a markup instance). When it has to be
edited, it can be edited either through a user interface or API (the
way your binary will), or by being serialized as markup, edited and
parsed again (the way XML sometimes is -- when it's not simply
manipulated as a binary).

In other words, the architecture is (as far as I can see) entirely
compatible with yours, with one big exception, namely that LMNL offers
a serialization format that can be manipulated as plain text,
syntax-checked, and parsed. I'm perfectly happy to concede that for
many purposes, an embedded markup representation may not prove to be
practical (when the markup gets "thick") or necessary (when we have
better interfaces than the parsing/serialization cycle). Yet it is
nevertheless so useful for so many things that I am just not in a
hurry to discard it.

Like Patrick, I think examples of markup being used badly say nothing
about markup or the potentials of markup -- some of which are not
realized in some examples of XML, and others of which are impeded by
XML itself. To say that examples of "bad XML" (or examples of XML that
you just happen not to like) demonstrate that embedded markup is
useless is, to me, like saying that because you can get on the
telephone and order a bad pizza to be delivered to your house --
something millions do every day, even though the pizza is bad --
therefore we should never bother with Italian cuisine.

Happy New Long Cycle!


Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables

        Date: Sat, 22 Dec 2012 07:28:10 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.612 XML & what kind of scholarship
        In-Reply-To: <20121221090829.3BB193A28 at digitalhumanities.org>


I have read this syntax statement carefully, although I have seen it
before. This document describes a tree-structure, like all grammars.
It resembles the "Trojan milestones" described by Steven DeRose at
Extreme Markup 2004 and also to the "Co-indexing" technique described
by Barnard in 1992. The key difference is that you implicitly connect
start and end milestones if they immediately follow without resorting
to the use of IDs. But any such connections are NOT part of the
language. They are the information content of the language. It is like
saying that "the cookie jar" is the jar with the cookies inside it
when convention states that it is only the glass, lid and shape of the
jar. Convention states that the term (computer recognisable)
"language" is the thing governed by a grammar. The rest of your syntax
specification is in plain English, not grammatical rules, and
describes the information content of the language. Since it is exactly
mappable, as you point out, to XML, it cannot describe anything more
than XML.

I don't think this notation would be usable by digital humanists.
Apart from serious verification problems which I won't go into here,
the modern user is mostly interested in point and click interfaces, in
Web-accessible applications, rather than markup tags. We only have to
think how to engineer that, and the technology under the hood does not
have to (and in my view should not) be human editable whatever it is.
The only markup humans can tolerate editing is simple and robust
wiki-type tags. But mostly they prefer to use GUIs for editing.

Since we clearly need more sophisticated data structures than mere
trees that is why I say that binary formats are better. What is
provably correct when written by a machine can be provably correct
when it is read in by a machine. But letting humans loose on it in the
middle is a recipe for trouble.

Your main point on why embedded markup is better than standoff (third
starred point) is relevant only if we continue to edit markup by hand.
Once we let the machine do this automatically this does not matter any
more. And with the embedded markup I still don't see how you can
combine sets of markup as you can with standoff properties.

Desmond Schmidt
eResearch Lab
University of Queensland

On Fri, Dec 21, 2012 at 7:08 PM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
> Desmond writes further (specifically about the LMNL project):
>> You already know I think the "sawtooth syntax" is not a computer
>> recognisable language because it apparently has no grammar that
>> governs its entire syntax. The equivalence of "has a grammar" and "is
>> computer recognisable" was acknowledged to be already "well known" by
>> Chomsky in 1959. So I don't think the sawtooth syntax can do what you
>> claim. However, I have no significant objection to the LMNL model
>> itself; in fact it is rather clever.
> The syntax has a grammar, here:
> http://lmnl-markup.org/specs/archive/Detailed_LMNL_syntax.xhtml
> What LMNL does not have is a grammar to describe document structures
> (as opposed to a markup syntax).

More information about the Humanist mailing list