[Humanist] 28.409 PostGreSQL and Solr for digital archives

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sat Oct 18 07:37:16 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 409.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Hugh Cayless <philomousos at gmail.com>                      (84)
        Subject: Re:  28.404 PostGreSQL and Solr for digital archives

  [2]   From:    Martin Holmes <mholmes at uvic.ca>                           (11)
        Subject: RE:  28.404 PostGreSQL and Solr for digital archives


--[1]------------------------------------------------------------------------
        Date: Fri, 17 Oct 2014 09:52:31 -0400
        From: Hugh Cayless <philomousos at gmail.com>
        Subject: Re:  28.404 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141017064531.B75ED622A at digitalhumanities.org>


Dear Desmond,

I am one of those who believe that TEI should eventually move toward
defining an abstract model, expressible in a variety of serializations (XML
being one of those). I don't think that view is particularly heretical
among users of the TEI.

That being said, however, a move like the one you suggest isn't feasible
for a variety of reasons. HTML is primarily a language for visually
formatting text+other media. TEI is primarily for encoding the semantics of
text+other media. This means there are a number of mismatches between the
TEI Way and the HTML Way which make such a 1::1 conversion very
difficult. It may now, given the current state of the technology, be
possible to sensibly express TEI in HTML but that doesn't solve many
problems by itself. Work is underway to define a "Simple" expression of TEI
that has both a data model and a processing model (e.g. TEI elements will
have formatting conventions) and this will, I hope, be a stepping stone
towards the goal I mentioned in the first paragraph, but there's a lot of
work to be done yet.

Furthermore, because of its nature, HTML is a moving target in ways that
TEI isn't. Having had the experience of migrating old (ca. 10 years or
more) TEI SGML/XML collections and old HTML collections, I can tell you the
TEI is *vastly* easier to deal with. It makes for a much better archival
format. Constraints are good from this perspective, and HTML has very few
constraints.

"Interchange" and "interoperable" are superficially simple concepts, but
the reality is very different. Interchange might mean many different things
in different contexts. Adhering to common standards such as TEI and XML
makes interchange *possible*, but nothing is going to make it plug-and-play.

Lastly, I don't really see a problem with a publishing workflow that has at
its core files most users won't access. The TEI files aren't themselves the
deliverable to users, it's the viewing and discovery interfaces that they
support which most users will want. To have that and a format you can build
a sensible editorial workflow around *and* a decent archival format that
preserves a great deal of the interpretive work that went into the files'
creation seems like a huge win to me.

All the best,
Hugh

On Fri, Oct 17, 2014 at 2:45 AM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 28, No. 404.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Thu, 16 Oct 2014 21:08:43 +1000
>         From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
>         Subject: Re:  28.400 PostGreSQL and Solr for digital archives
>         In-Reply-To: <20141016064745.23806622A at digitalhumanities.org>
>
>
> Hi Martin,
>
> I'd like to expand the discussion a bit, but my point of departure is
> your remark that: "little if anything of the TEI encoding is actually
> available to the user". The technical reason for this is, of course,
> that these applications do not intrinsically support XML, although they
> can import it. But the underlying reason is that we encoded the XML
> through the exercise of human judgement and interpretation. It should
> then come as no surprise that some of that information gets lost when it
> is read by a machine.
>
> What I would like to suggest as a remedy to this situation is that we
> stop trying to share our data on the *basis* of human-determined tags.
> Instead we could use HTML and encode the interpretative part as class
> attributes or as RDFa or microformats. TEI could become an *abstract*
> set of names defining textual properties, without reference to any
> specific technology. One way of recording and expressing those
> properties could be via HTML. If we did that then everyone's files would
> be interoperable because they would already be in the language of the
> Web.
>
> Of course, we can convert the XML to HTML whenever we want, but we don't
> seek to share it in that form, we seek instead to share the XML, and we
> can't, because TEI-XML is not interoperable. And yet, there is nothing
> in TEI-XML that can't be expressed in some alternative way in HTML.
> Especially since, according to the <a
> href="http://jtei.revues.org/372">recent survey by Burghart</a>, 97% of
> TEI-encoded texts of manuscripts (and probably a similar proportion of
> printed texts) just get converted into HTML anyway. So please explain to
> me why we need to use XML, because I really don't see it.
>
> Desmond Schmidt
> Queensland University of Technology


--[2]------------------------------------------------------------------------
        Date: Fri, 17 Oct 2014 15:36:42 +0000
        From: Martin Holmes <mholmes at uvic.ca>
        Subject: RE:  28.404 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141017064531.B75ED622A at digitalhumanities.org>


Hi Desmond,

One obvious reason for encoding initially in TEI (regardless of what you end up producing) is that you can use the excellent Guidelines and easily create customized schemas for your project which express your theoretical approaches and constrain your practice to match them. YOu could of course do this in HTML too, but this:

<span class="tei_persName"><span class="tei_forename">Fred></span> <span class="tei_surname">Bloggs</span></span>

is a lot less friendly and human-readable than this:

<persName><forename>Fred></forename> <surname>Bloggs</surname></persName>

Cheers,
Martin

Martin Holmes
mholmes at uvic.ca
martin at mholmes.com
mholmes at halfbakedsoftware.com




More information about the Humanist mailing list