[Humanist] 23.785 inadequacies of markup

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Apr 30 07:45:22 GMT 2010


                 Humanist Discussion Group, Vol. 23, No. 785.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Toma Tasovac <ttasovac at transpoetika.org>                   (9)
        Subject: Re: [Humanist] 23.778 inadequacies of markup

  [2]   From:    Martin Mueller <martin.mueller at mac.com>                  (153)
        Subject: Re: [Humanist] 23.778 inadequacies of markup

  [3]   From:    Desmond Schmidt <desmond.schmidt at qut.edu.au>              (93)
        Subject: RE: [Humanist] 23.778 inadequacies of markup


--[1]------------------------------------------------------------------------
        Date: Thu, 29 Apr 2010 10:54:23 +0200
        From: Toma Tasovac <ttasovac at transpoetika.org>
        Subject: Re: [Humanist] 23.778 inadequacies of markup
        In-Reply-To: <20100429052452.E6E9451C55 at woodward.joyent.us>


Dear Desmond,

I was struck by a very strong statement toward the end of your essay: "Standards should not be followed if they lead to mutilation of the data." On the face of it, nobody could argue against this statement, yet I can't help noticing that the very idea of mutilation presupposes that there is a body to be dismembered and chopped off. Embedded markup turns the marital bed of data and its interpretation into a crime scene. If your essay was a Hollywood movie, it would star Julia Roberts in a TEI remake of Sleeping with the Enemy.

Now, as somebody who came to text encoding directly from literary theory, I have, for many years now, actually celebrated the above mentioned marital bed, not as a site of some low-budget horror fantasy but rather as a place of playful interaction between the text and its interpretation. The digital text has made it possible for us to explore and experience in very real, kinetic terms what it means to say that the author is dead. The digital text has also made it possible to illustrate - on a practical, demonstrative level - what it means to say that there is no such thing as a stable, immutable text.

I have always tried to tell my students and analogue humanists that text encoding is an extended hand of literary theory: it allows us, if you will, to sleep in the same bed with our text. And while cultural heritage folks may shudder at the thought of contamination of the original objects they are trying to preserve, the literary theorist in me is exhilarated by the possibility of textual ("objective") and interpretative ("subjective") cohabitation.

You make incredibly important, astute and, above all, timely observations in your essay. Your treatment of the overlap problem, textual variation and multi-version documents are fascinating and necessary. I believe that we need much better software solutions for the problem of the text obscured by highly complex technical data, but I don't see how storing all texts that represent our cultural heritage in a non-markup environment would always make them more accurate. Some texts - such as historical dictionaries, for instance - would have much to lose and nothing to gain in their plain text versions.

Digitalizing textual artifacts already changes their actual bodies beyond recognition. I don't think that we should fetishize these already altered bodies as new, untouchable originals. Messiness does not always inhibit creativity.

All best, Toma

Toma Tasovac 
Center for Digital Humanities (Belgrade, Serbia)
http://humanistika.org95 http://transpoetika.org

--[2]------------------------------------------------------------------------
        Date: Thu, 29 Apr 2010 08:15:06 -0500
        From: Martin Mueller <martin.mueller at mac.com>
        Subject: Re: [Humanist] 23.778 inadequacies of markup
        In-Reply-To: <20100429052452.E6E9451C55 at woodward.joyent.us>

Isn't it also a fact that writing is an "industrial tool" or business technology?  There is widespread agreement that the re-introduction of writing into the Greek world around 800BCE is the result of interaction with Phoenician traders. Some people see the Iliad as having a causative force in the spread of writing. I think it is more likely that the poet(s) of the Iliad used this new technology and that the Iliad, as we now have it, is a different poem for that reason. 

When it comes to 'clerical' technologies there is not much point in driving a wedge between humanist and business interests . They have also been inextricably interwoven. Were it not for business, a point Christian Wittern makes well, we would not have thousands of clever people all over the world thinking hard about how to lower the entry barriers for human/machine interaction. Not all of that is benign: there is a lot of dumbing down. But then there has always been a lot of dumbing down. 

I found Desmond Schmidt's article very interesting, especially in its excellent review of the history of the TEI and its choice of SGML for implementation.  But I have two disagreements or questions. First, it may be the case that no text is ever a perfect OHCO. Whenever I think about TEI, I remember the line from Wallace Stevens' Connoisseurs of Chaos': "The squirming facts exceed the squamous mind".  On the other hand, there are a lot  of advantages in treating texts, especially corpora or large numbers of texts that are supposed to be "interoperable" as if they were OHCOs. You lose some on that, but you gain a lot more, and on the trade-off the benefits win. 

Secondly, I must confess that Schmidt lost me when he drew his own model with its very complex way of expressing textual variants.I didn't quite get. Perhaps I would have got it if I had attended a little more carefully. But I am quite sure that from the perspective of modal humanists the entry barriers for understanding this approach are much higher than the time cost (much less now than a decade ago, and largely because of investments in the business world) of learning how to do use oXygen (or similar programs) and do useful stuff with TEI-encoded texts. 
On Apr 29, 2010, at 12:24 AM, Humanist Discussion Group wrote:

> 
>                 Humanist Discussion Group, Vol. 23, No. 778.
>         Centre for Computing in the Humanities, King's College London
>                       www.digitalhumanities.org/humanist
>                Submit to: humanist at lists.digitalhumanities.org
> 
>  [1]   From:    Desmond Schmidt <desmond.schmidt at qut.edu.au>              (48)
>        Subject: RE: [Humanist] 23.776 inadequacies of markup
> 
>  [2]   From:    "Dino Buzzetti" <buzzetti at philo.unibo.it>                 (25)
>        Subject: Re: [Humanist] 23.776 inadequacies of markup
> 
>  [3]   From:    Christian Wittern <cwittern at gmail.com>                    (71)
>        Subject: Re: [Humanist] 23.775 noticing the inadequacies
> 
> 
> --[1]------------------------------------------------------------------------
>        Date: Mon, 26 Apr 2010 20:36:46 +1000
>        From: Desmond Schmidt <desmond.schmidt at qut.edu.au>
>        Subject: RE: [Humanist] 23.776 inadequacies of markup
>        In-Reply-To: <20100426051016.249C8526F8 at woodward.joyent.us>
> 
> I'd like to thank John Walsh for reading my article. I am very grateful for having this public discussion of its contents. But I'd like to respond to his two points, not because I want to refute them (I don't think that is really possible) but because the other side of the argument needs to be stated for those for those who won't read whole thing.
> 
> On point 1: It's just a fact - unpleasant or otherwise - that XML is an industrial tool. XML is based on SGML, which was standardised by IBM, and is much more widely used in industry than by humanists. SGML predated TEI - the original specification left it open as to which tool should be used. IBM's SGML was then chosen, having not been developed by humanists at all (to my knowledge). In fact some of SGML's more humanist-friendly features such as markup minimisation and CONCUR were left out of XML. I'm not really criticising XML. I use it every day in my work and it is a wonderful engineering tool. What I argued in the paper was that it is unsuited to encoding historical texts in the humanities that never had such codes in them when written.
> 
> On point 2: It's a matter of opinion how significant the embedding of subjective markup codes into the text actually is. In the paper I argued that the thing being interpreted is the text, not the markup. It's not just archiving that is affected. The sharing of texts containing someone else's interpretations biases the research that another person wishes to undertake. It is true that even transcribing a text sans markup is an act of interpretation, but the effect is slight compared to the amount of subjective markup that is then embedded on the basis of that largely academic argument.
> 
> ------------------------------
> Dr Desmond Schmidt
> Information Security Institute
> Faculty of Information Technology
> Queensland University of Technology
> (07)3138-9509
> 
> 
> --[2]------------------------------------------------------------------------
>        Date: Mon, 26 Apr 2010 19:42:44 -0200
>        From: "Dino Buzzetti" <buzzetti at philo.unibo.it>
>        Subject: Re: [Humanist] 23.776 inadequacies of markup
>        In-Reply-To: <20100426051016.249C8526F8 at woodward.joyent.us>
> 
> I, for one, have not found anything "sloppy, careless, and
> thoughtless" in Desmond Schmidt's paper.
> 
> All best,      -dino buzzetti
> 
> -- 
> Dino Buzzetti                     <buzzetti at philo.unibo.it>
> Department of Philosophy
> University of Bologna                 tel.    +39 051 20 98357
> via Zamboni, 38                       fax                98355
> I-40126 Bologna BO           http://antonietta.philo.unibo.it 
> 
> 
> 
> --[3]------------------------------------------------------------------------
>        Date: Tue, 27 Apr 2010 15:28:15 +0900
>        From: Christian Wittern <cwittern at gmail.com>
>        Subject: Re: [Humanist] 23.775 noticing the inadequacies
>        In-Reply-To: <20100425080710.19FCB5303F at woodward.joyent.us>
> 
> This is a reply to the note by Desmond Schmidt on his LLC paper and the 
> following [excerpted] comment by WM:
> 
> On 2010-04-25 17:07, Humanist Discussion Group wrote:
>> 
>> I wonder, here outloud, whether collaborative projects, based on a
>> common understanding of what's going on, don't tend to attenuate
>> creative thinking. I wonder whether standards (so-called or otherwise),
>> which enable a common effort, don't at the same time dampen experiment?
>> Once something that can be routinised is moved from the laboratory to
>> the factory, isn't it time to move on? Or, even more annoyingly perhaps,
>> isn't it time to question our successes?
> 
> As always, I think a cautionary note and a hesitating mindset is 
> appropriate, but on the other side, I think that exploring the inner 
> regions of this newly discovered continent seems to be more appropriate 
> than to quibble about which exact width our railway tracks should have.  
> Once we settle on one measure, we should busy ourselves to built the 
> network, connect the remote locations and enjoy our findings.  At the 
> same time, there might be room for developing high-speed trains to 
> connect some key areas, or other experiments, but such projects would by 
> necessity proceed with a different priority, and probably on a different 
> timescale.
> 
> I agree with Desmond Schmidt (and, as he says most others who have 
> thought about this), that we are still in the age of digital 
> incunables.  Text Encoding is still in its infancy and a *lot* of 
> experimenting is still going on, whole new archipelagos are discovered,  
> even as in the areas were we arrived first some factories started working.
> 
> Now to take up some points from Desmond's paper, I think it is important 
> to not forget the 'I' in TEI which stands also for 'interchange'.  While 
> the TEI Guidelines are used by many projects I know of as primary 
> formats, there are also many projects that internally use a different 
> format (for a whole range of reasons), but strive to be able to express 
> their results *also* in TEI, in order to be able to exchange data with 
> other projects, but also as archival versions that might be used in 
> later stages of the project.  This is enabling us to talk with each 
> other, observe and name the features in our text in a way that bridges 
> the individual projects.
> 
> The issues Desmond raises against the way textual variants are encoded 
> in TEI are valid and well taken; this is an area that indeed requires 
> more research and experiments; the MVD list structure is a welcome 
> contribution in that respect.  I do think it should be both possible and 
> worthwhile to come up with a way to encode such graph and list 
> structures in TEI.
> 
> Another area where important concerns are raised is the level of 
> expertise that is required to work on XML encoded TEI texts by directly 
> editing the source in an XML editor.   This is where the demands of the 
> technology frequently gets in the way of its users and obscures rather 
> than illuminates -- we definitely should strive to do better.  However, 
> I am not convinced that the "command line interface" against "graphical 
> user interface" dichotomy, that Desmond tries to construct here goes to 
> the heart of the matter.    It seems to me that we have to learn is to 
> build tools that combine both a GUI that hides unnecessary details from 
> the users, but still allows the power of working with commands, which 
> for example also includes the ability to chain together frequently used 
> commands to a new single command.  The Author mode of oXygen is an 
> attempt to do this, as was a similar mode of "hiding the tags" that 
> early tools like Author/Editor did provide.    I think that the 
> combination of XML databases and the dynamic interaction with text they 
> enable with new user interfaces (possible browser based, but maybe even, 
> gasp, with Emacs?) has an enormeous potential here and expect to see 
> some innovation in this area in the next years.
> 
> This brings me to another point that Desmond makes in his paper, about 
> the "industrial use" of XML, which he makes sound a bit dirty.  To me, 
> this means that as Digital Humanists, we can expand our toolbox and 
> expect to be able to tap into a much larger pool of talent and 
> developers that we could have available otherwise.  A mixed blessing 
> maybe, but I see quite a potential to find a way here to leave the 
> craddle of digital text and enter in early childhood -but there is 
> certainly a lot of growing up to expect and certainly a lot of creative 
> thinking!
> 
> Christian Wittern, Kyoto
> 
> 
> 
> _______________________________________________
> List posts to: humanist at lists.digitalhumanities.org
> List info and archives at at: http://digitalhumanities.org/humanist
> Listmember interface at: http://digitalhumanities.org/humanist/Restricted/listmember_interface.php
> Subscribe at: http://www.digitalhumanities.org/humanist/membership_form.php



--[3]------------------------------------------------------------------------
        Date: Fri, 30 Apr 2010 10:04:37 +1000
        From: Desmond Schmidt <desmond.schmidt at qut.edu.au>
        Subject: RE: [Humanist] 23.778 inadequacies of markup
        In-Reply-To: <20100429052452.E6E9451C55 at woodward.joyent.us>

Let me also thank Christian Wittern for reading the paper, which is 20 pages long, and taking the trouble to respond to it with such detail and elegance. I hope I can match him.

1. Certainly the adoption of XML has facilitated interchange in comparison with the situation before. As I pointed out in the article, however, there are some forces at work behind the scenes that either prevents that from happening or makes it difficult. Firstly, the TEI Guidelines are not fixed. They are always expanding and changing, and are often customised by users. And It is not possible in practice to keep out end-result related information from what is supposed to be a purely generalised encoding scheme. Maybe the information gets expressed as generalised markup, but it is often information about concordancing, collation, external formatting, screen layout, links to particular files, etc. Also, embedding subjective markup into the text means that sharing is compromised. What if I don't want your markup codes, and want to add my own? I accept that for some people this isn't a requirement, but for others it is quite a real problem. Taking out the markup is not as simple a matter as it might seem at first glance. But if I could combine markup sets with the text freely (not as standoff markup but as standoff annotation) then I could share the actual text with someone else without giving them my embedded and unremovable bias.

2. My point about modern digital texts in the humanities really being digital incunables was that it is the embedded markup in our texts that makes them digital incunables. Christian seems to imply that we can still move from digital incunable to true digital texts, and keep embedded markup. I argued at some length that this isn't possible. Firstly because modern generalised markup is based on printed format structures in the same way that the first incunable books were based on manuscript techniques. The nested structures of XML and any other computable-recognisable embedded markup language you might care to define will have the same structure. There is no escaping this, and it has been known since the 1950s. I don't see any way forward other than eventually abandoning the embedding altogether for these types of texts. And I'm not the first to say this. Angelo Di Iorio was saying the same thing at Balisage just last year: 'getting rid of embeddability altogether', and Dino Buzzetti urged it in his fine 2002 article. So why can't we move on to using markup more constructively?

3. It must seem quite tempting to try to represent the MVD data structure in XML, and quite a few people have suggested it. There are a few reasons why you might want to do this:
a) readability: If the individual versions of an MVD are themselves encoded as XML *within* the XML of the MVD, then each version will necessarily contain escaped XML markup. You can't have '<' and '>' or '&' as the content of an element without escaping them. Another problem is that the fragments of content in an MVD are often a single character. Both points would make the XML quite unreadable.
b) editability: Even if you could read and understand an MVD in XML form, the MVD list structure is not amenable to human editing. It is very delicate and small alterations can easily render it invalid. It can only be updated by a provably correct program, such as the nmerge program, which is provided under a free licence.
c) use of standard XML tools: You can only do limited processing of the MVD-XML using standard XML tools. You could inefficiently list a version, but this operation is already performed efficiently by the nmerge tool. I don't see how you can search all versions simultaneously or merge or update versions using standard XML tools, but you can with nmerge. And I don't see why you would want to generate variants or compare versions outside of an MVD when all the versions are already compared with all other versions within the MVD format. All you have to do is read it.
d) archivability: If you don't like your texts in MVD format (although it's compact and sharable) just archive the versions out into separate files, and they will have exactly the format that they had when you put them in.
For these reasons I can't see any practical advantage in representing MVDs in XML, but I can see lots of disadvantages.

4. Christian argues that it's possible to hide the markup tags from the user and provide a friendly user interface with the power of markup hidden under the hood, and that XML-editors like oXygen allow you to do this. Unfortunately embedding the markup into the text necessarily exposes it to the humanist editor at some point. Corpus linguists may be able to ignore the markup to some extent, but for people who manually encode markup tags into the texts, what kind of human interface can you have? For the editor the only way to 'hide' markup would be to display a huge palate of possible codes to embed at some point in the text, combined with copious documentation as to their significance and which attributes are permissible, and what they all mean. Surely with 512 elements that would be too confusing. It's true that we can transform some tags into formatting but this form of tag-hiding only applies to reading a text, and what about the other tags that can't be converted into formats? I don't see how there can be any way around this user interface problem with embedded markup now or in the future. If there is, I would certainly like to know.

5. I'd like to sneak in another point that he doesn't make, and I didn't make in the paper. The real reason I wrote this article about the inadequacy of embedded markup is because of the inadequacy I feel as a software engineer and humanist at being able to satisfy the legitimate needs of the users. The Human-Computer-Interface (HCI) people tell us that we have to identify the users, who they are and what tasks they wish to perform, and then design the software around that. The problem I have is that in trying to do this I encounter the inadequate representation of embedded markup that gets in the way, that frustrates especially my ability to represent versions. There is a big difference between writing texts with markup codes and embedding markup codes into texts that never had them before. The problems that arise from taking that course of action won't go away with advances in technology however long we wait.

Desmond Schmidt
Queensland University of Technology, Brisbane

--[3]------------------------------------------------------------------------
        Date: Tue, 27 Apr 2010 15:28:15 +0900
        From: Christian Wittern <cwittern at gmail.com>
        Subject: Re: [Humanist] 23.775 noticing the inadequacies
        In-Reply-To: <20100425080710.19FCB5303F at woodward.joyent.us>

This is a reply to the note by Desmond Schmidt on his LLC paper and the
following [excerpted] comment by WM:

On 2010-04-25 17:07, Humanist Discussion Group wrote:
>
> I wonder, here outloud, whether collaborative projects, based on a
> common understanding of what's going on, don't tend to attenuate
> creative thinking. I wonder whether standards (so-called or otherwise),
> which enable a common effort, don't at the same time dampen experiment?
> Once something that can be routinised is moved from the laboratory to
> the factory, isn't it time to move on? Or, even more annoyingly perhaps,
> isn't it time to question our successes?

As always, I think a cautionary note and a hesitating mindset is
appropriate, but on the other side, I think that exploring the inner
regions of this newly discovered continent seems to be more appropriate
than to quibble about which exact width our railway tracks should have.
Once we settle on one measure, we should busy ourselves to built the
network, connect the remote locations and enjoy our findings.  At the
same time, there might be room for developing high-speed trains to
connect some key areas, or other experiments, but such projects would by
necessity proceed with a different priority, and probably on a different
timescale.

I agree with Desmond Schmidt (and, as he says most others who have
thought about this), that we are still in the age of digital
incunables.  Text Encoding is still in its infancy and a *lot* of
experimenting is still going on, whole new archipelagos are discovered,
even as in the areas were we arrived first some factories started working.

Now to take up some points from Desmond's paper, I think it is important
to not forget the 'I' in TEI which stands also for 'interchange'.  While
the TEI Guidelines are used by many projects I know of as primary
formats, there are also many projects that internally use a different
format (for a whole range of reasons), but strive to be able to express
their results *also* in TEI, in order to be able to exchange data with
other projects, but also as archival versions that might be used in
later stages of the project.  This is enabling us to talk with each
other, observe and name the features in our text in a way that bridges
the individual projects.

The issues Desmond raises against the way textual variants are encoded
in TEI are valid and well taken; this is an area that indeed requires
more research and experiments; the MVD list structure is a welcome
contribution in that respect.  I do think it should be both possible and
worthwhile to come up with a way to encode such graph and list
structures in TEI.

Another area where important concerns are raised is the level of
expertise that is required to work on XML encoded TEI texts by directly
editing the source in an XML editor.   This is where the demands of the
technology frequently gets in the way of its users and obscures rather
than illuminates -- we definitely should strive to do better.  However,
I am not convinced that the "command line interface" against "graphical
user interface" dichotomy, that Desmond tries to construct here goes to
the heart of the matter.    It seems to me that we have to learn is to
build tools that combine both a GUI that hides unnecessary details from
the users, but still allows the power of working with commands, which
for example also includes the ability to chain together frequently used
commands to a new single command.  The Author mode of oXygen is an
attempt to do this, as was a similar mode of "hiding the tags" that
early tools like Author/Editor did provide.    I think that the
combination of XML databases and the dynamic interaction with text they
enable with new user interfaces (possible browser based, but maybe even,
gasp, with Emacs?) has an enormeous potential here and expect to see
some innovation in this area in the next years.

This brings me to another point that Desmond makes in his paper, about
the "industrial use" of XML, which he makes sound a bit dirty.  To me,
this means that as Digital Humanists, we can expand our toolbox and
expect to be able to tap into a much larger pool of talent and
developers that we could have available otherwise.  A mixed blessing
maybe, but I see quite a potential to find a way here to leave the
craddle of digital text and enter in early childhood -but there is
certainly a lot of growing up to expect and certainly a lot of creative
thinking!

Christian Wittern, Kyoto

_______________________________________________
List posts to: humanist at lists.digitalhumanities.org
List info and archives at at: http://digitalhumanities.org/humanist
Listmember interface at: http://digitalhumanities.org/humanist/Restricted/listmember_interface.php
Subscribe at: http://www.digitalhumanities.org/humanist/membership_form.php





More information about the Humanist mailing list