[Humanist] 28.400 PostGreSQL and Solr for digital archives

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Oct 16 08:47:45 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 400.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Martin Mueller <martinmueller at northwestern.edu>           (57)
        Subject: Re:  28.394 PostGreSQL and Solr for digital archives

  [2]   From:    Ed Summers <ehs at pobox.com>                                (16)
        Subject: Re:  28.394 PostGreSQL and Solr for digital archives


--[1]------------------------------------------------------------------------
        Date: Wed, 15 Oct 2014 11:44:16 +0000
        From: Martin Mueller <martinmueller at northwestern.edu>
        Subject: Re:  28.394 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141015053649.8FA7A6083 at digitalhumanities.org>


Desmond asks a pointed question that has also been on my mind. It is one
thing to store data in XML. It is another to mediate the query potential
of the XML in such a manner that users can get at it. I call this
"decoding the encoded." In the TEI world I'm familiar with quite a few
projects with a very "lossy" interface: little if anything of the TEI
encoding is actually available to the user. As I understand it, Solr can
get you some of the XML encoding with indexing that associates words with
some of the information kept in Xpaths. But all of them?? So I'd be
interested in the trade-offs involved in transforming XML into SQL in the
particular projects Ashley and Ed write about. What gets lost? And who
gets to decide whether it matters?

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

On 10/15/14 12:36 AM, "Humanist Discussion Group"
<willard.mccarty at mccarty.org.uk> wrote:

>                 Humanist Discussion Group, Vol. 28, No. 394.
>            Department of Digital Humanities, King's College London
>                       www.digitalhumanities.org/humanist
>                Submit to: humanist at lists.digitalhumanities.org
>
>        Date: Mon, 13 Oct 2014 16:23:42 +1000
>        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
>        Subject: Re:  28.393 PostGreSQL and Solr for digital archives
>        In-Reply-To: <20141013052556.64A04656E at digitalhumanities.org>
>
>
>Hi Ed,
>
>That's the closest parallel yet to Ashley's original question.
>
>> Being able to get your production IT folks involved in the maintenance
>> of your database can be a big win from a sustainability/sanity
>> perspective
>
>A good point, and one often missed, is how to motivate the forces around
>you to contribute to *maintenance* of the data.
>
>> One thing that we did not do in the process of switching was to get
>> rid of our XML based workflow
>
>Forgive me if I read you wrong but I would put a different spin on your
>change from "Cocoon, Fedora" to "Django ... MySql", namely that you did
>indeed get rid of your XML-based workflow.
>
>> We needed to write some programs to parse the XML data and load bits
>> of it into MySQL and Solr
>
>So by the time it gets to Solr is there any XML left? As I see it,
>your new workflow no longer transforms XML (that was the Cocoon part,
>right?), but only uses the XML as a textual repository to do
>searches. Isn't that a reduction in XML functionality? That would fit in
>with the general impression of the other examples being made here.
>
>Desmond Schmidt
>Queensland University of Technology




--[2]------------------------------------------------------------------------
        Date: Wed, 15 Oct 2014 15:30:10 -0400
        From: Ed Summers <ehs at pobox.com>
        Subject: Re:  28.394 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141015053649.8FA7A6083 at digitalhumanities.org>



Hi Desmond,

> Forgive me if I read you wrong but I would put a different spin on your
> change from "Cocoon, Fedora" to "Django ... MySql", namely that you did
> indeed get rid of your XML-based workflow.

You are most welcome to spin it :-) All I meant to say is that the existing XML based workflow for the National Digital Newspaper Program didn’t change substantially. We continued to receive hard drives with XML, TIFF, JP2 metadata on them, which continued to be moved to archival storage and moved to our access application just as they always had. 

The access application (Chronicling America) on the other hand did change quite a bit. Rather than storing the XML in Fedora and transforming it on the fly to HTML with XSLT we parsed the metadata we needed from the XML and stored it MySQL and Solr. The Django web application then queried MySQL and Solr to deliver up its views.

> So by the time it gets to Solr is there any XML left? As I see it,
> your new workflow no longer transforms XML (that was the Cocoon part,
> right?), but only uses the XML as a textual repository to do
> searches. Isn't that a reduction in XML functionality? That would fit in
> with the general impression of the other examples being made here.

You’re right, Solr and MySQL were not used to store XML. However, XML continued to arrive on hard drives, and continued to be used to populate MySQL and Solr. NDNP and Chronicling America are not static or closed: new data is generated and is processed all the time. And that data continues to be primarily XML based, even though we moved Chronicling America to MySQL/Solr.

I actually didn’t observe a reduction of XML functionality. In fact I found that the new  workflow highlighted XML’s strengths as a data interchange format. Some day (perhaps soon) the access application will be rewritten using a better web framework and database. When that day comes they can reach for the XML data as a data source.

I’m not sure if that helps clarify much, but thanks for the response!

//Ed





More information about the Humanist mailing list