[Humanist] 28.394 PostGreSQL and Solr for digital archives

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Oct 15 07:36:49 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 394.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (21)
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives

  [2]   From:    "Reed, Ashley" <reeda at email.unc.edu>                     (220)
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives


--[1]------------------------------------------------------------------------
        Date: Mon, 13 Oct 2014 16:23:42 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141013052556.64A04656E at digitalhumanities.org>


Hi Ed,

That's the closest parallel yet to Ashley's original question.

> Being able to get your production IT folks involved in the maintenance
> of your database can be a big win from a sustainability/sanity
> perspective

A good point, and one often missed, is how to motivate the forces around
you to contribute to *maintenance* of the data.

> One thing that we did not do in the process of switching was to get
> rid of our XML based workflow

Forgive me if I read you wrong but I would put a different spin on your
change from "Cocoon, Fedora" to "Django ... MySql", namely that you did
indeed get rid of your XML-based workflow.

> We needed to write some programs to parse the XML data and load bits
> of it into MySQL and Solr

So by the time it gets to Solr is there any XML left? As I see it,
your new workflow no longer transforms XML (that was the Cocoon part,
right?), but only uses the XML as a textual repository to do
searches. Isn't that a reduction in XML functionality? That would fit in
with the general impression of the other examples being made here.

Desmond Schmidt
Queensland University of Technology



--[2]------------------------------------------------------------------------
        Date: Tue, 14 Oct 2014 17:30:11 +0000
        From: "Reed, Ashley" <reeda at email.unc.edu>
        Subject: Re:  28.393 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141013052556.64A04656E at digitalhumanities.org>


Thank you, everyone, for your helpful input. I think my original message didn’t make clear that the Archive has no plans to move away from XML. Our partners know we’re committed to it; they're suggesting that we keep using XML for encoding and archiving but use PostGreSQL and Solr (which they can support) for serving and searching it. This may or may not turn out to be the best way to handle things (“best” being a consideration that has to include “what can be supported”). We’re still investigating, and your thoughts have been most welcome.

Ed Summers’s message about the Chronam application (way down at the bottom of this digest) seems pretty close to what we might be looking at, though obviously we don’t have eight million pages to deal with. Blake was prolific, but not that prolific.

All best,
Ashley Reed

On Oct 13, 2014, at 1:25 AM, Humanist Discussion Group <willard.mccarty at mccarty.org.uk> wrote:

>                 Humanist Discussion Group, Vol. 28, No. 393.
>            Department of Digital Humanities, King's College London
>                       www.digitalhumanities.org/humanist
>                Submit to: humanist at lists.digitalhumanities.org
> 
>  [1]   From:    Joris van Zundert <joris.van.zundert at huygens.knaw.nl>    (160)
>        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
> 
>  [2]   From:    Ed Summers <ehs at pobox.com>                                 (8)
>        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
> 
> 
> --[1]------------------------------------------------------------------------
>        Date: Sun, 12 Oct 2014 10:16:35 +0200
>        From: Joris van Zundert <joris.van.zundert at huygens.knaw.nl>
>        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
>        In-Reply-To: <20141012065352.D4A425F70 at digitalhumanities.org>
> 
> 
> Hi,
> 
> Has a very thorough needs and capabilities analysis been drawn up?
> Switching technologies is something best not considered at a whim.
> 
> This is one of those odd things I see happening a lot in technological
> contexts. It is like switching from a hole punch to a scissors because your
> office supplier doesn't know or is not able to deliver hole punchers.
> Although I sympathize with already overburdened IT support departments, the
> motivation is solely on the side of the IT capabilities. But one should
> consider foremost if the project is served. Secondly, in the long run this
> may be a more costly choice because you'll need to rebuild and add all the
> XML support that's inbuilt into eXist. Be careful to not to use a hammer to
> drive in a screw.
> 
> Best
> --Joris
> 
> On Sunday, October 12, 2014, Humanist Discussion Group <
> willard.mccarty at mccarty.org.uk> wrote:
> 
>>                 Humanist Discussion Group, Vol. 28, No. 391.
>>            Department of Digital Humanities, King's College London
>>                       www.digitalhumanities.org/humanist
>>                Submit to: humanist at lists.digitalhumanities.org
>> <javascript:;>
>> 
>> 
>> 
>>        Date: Sun, 12 Oct 2014 00:04:01 +0000
>>        From: "Reed, Ashley" <reeda at email.unc.edu <javascript:;>>
>>        Subject: Re:  28.386 PostGreSQL and Solr for digital archives
>>        In-Reply-To: <20141011064115.3D39C661C at digitalhumanities.org
>> <javascript:;>>
>> 
>> 
>> Martin has his hit the nail on the head: "what your shop is familiar
>> with.” It’s not feasible for our hosts to support eXist for a single
>> project (and no one else on campus uses it), so as we reimplement the site
>> we’re trying to find a solution that will allow us to maintain our high
>> standards without placing undue burden on the people and units that keep us
>> up and running.
>> 
>> Thanks for these (and any other forthcoming) replies.
>> 
>> Best,
>> Ashley
>> 
>> On Oct 11, 2014, at 2:41 AM, Humanist Discussion Group <
>> willard.mccarty at mccarty.org.uk <javascript:;>> wrote:
>> 
>>> 
>>> 
>>> 
>> --[1]------------------------------------------------------------------------
>>>       Date: Fri, 10 Oct 2014 15:53:32 +1000
>>>       From: Desmond Schmidt <desmond.allan.schmidt at gmail.com
>> <javascript:;>>
>>>       Subject: Re:  28.383 PostGreSQL and Solr for digital archives?
>>>       In-Reply-To: <20141010045842.AFE8265D1 at digitalhumanities.org
>> <javascript:;>>
>>> 
>>> 
>>> Hi Ashley,
>>> 
>>> Postgresql is a quality relational database that has been around a long
>>> time. It's reliable and fast, but it is not a drop-in replacement for
>>> eXist. Unlike eXist its support for XML is rudimentary. So I'm kind of
>>> curious as to why this apparent shift away from XML.
>>> 
>>> Desmond Schmidt
>>> Queensland University of Technology
>>> 
>>> On Fri, Oct 10, 2014 at 2:58 PM, Humanist Discussion Group <
>>> willard.mccarty at mccarty.org.uk <javascript:;>> wrote:
>>> 
>>>>                Humanist Discussion Group, Vol. 28, No. 383.
>>>>           Department of Digital Humanities, King's College London
>>>>                      www.digitalhumanities.org/humanist
>>>>               Submit to: humanist at lists.digitalhumanities.org
>> <javascript:;>
>>>> 
>>>> 
>>>> 
>>>>       Date: Thu, 9 Oct 2014 23:35:05 +0000
>>>>       From: "Reed, Ashley" <reeda at email.unc.edu <javascript:;>>
>>>>       Subject: PostGreSQL and Solr for digital archives
>>>> 
>>>> The William Blake Archive is in the process of migrating our site off of
>>>> the eXist platform as part of a larger reimplementation and redesign
>>>> project. Our partners have recommended that our next iteration employ
>>>> PostGreSQL (for the web application) and Solr (for searching). We are
>>>> curious to know whether other digital humanities projects (XML-based
>>>> digital archives in particular) use this combination of platforms. Solr
>>>> seems to be widely used for faceted searching, but we know less about
>>>> PostGreSQL and would like to know if other projects have employed it
>> and,
>>>> if so, what your experiences have been.
>>>> 
>>>> Thanks in advance for information and advice.
>>>> 
>>>> Ashley Reed
>>>> Andrew W. Mellon Postdoctoral Fellow in Digital Humanities, Carolina
>>>> Digital Humanities Initiative
>>>> Consultant, William Blake Archive
>>> 
>>> 
>>> 
>>> 
>>> 
>> --[2]------------------------------------------------------------------------
>>>       Date: Fri, 10 Oct 2014 11:48:46 +0000
>>>       From: Martin Mueller <martinmueller at northwestern.edu
>> <javascript:;>>
>>>       Subject: Re:  28.383 PostGreSQL and Solr for digital archives?
>>>       In-Reply-To: <20141010045842.AFE8265D1 at digitalhumanities.org
>> <javascript:;>>
>>> 
>>> 
>>> I would be interested to learn why you want to move away from an XML
>>> database when the data are XML data to begin with, especially at a time
>>> when there is so much talk about the advantages of 'NoSQL'.But if you do
>>> the choice between Postgresql and MySQL is probably six of one and half a
>>> dozen of the other and has less to do with intrinsic advantages than with
>>> what your shop is familiar with.
>>> 
>>> Martin Mueller
>>> Professor emeritus of English and Classics
>>> Northwestern University
>>> 
>>> 
>>> 
>> --[3]------------------------------------------------------------------------
>>>       Date: Fri, 10 Oct 2014 09:39:55 -0500
>>>       From: Patricia Galloway <galloway at ischool.utexas.edu
>> <javascript:;>>
>>>       Subject: Re: PostGreSQL and Solr for digital archives
>>>       In-Reply-To: <
>> mailman.3.1412935202.7328.humanist at lists.digitalhumanities.org
>> <javascript:;>>
>>> 
>>> 
>>> On 10/10/2014 5:00 AM, humanist-request at lists.digitalhumanities.org
>> <javascript:;> wrote:
>>>> PostGreSQL and Solr for digital archives
>>> 
>>> This is the default combination for the current version of DSpace and
>>> PostGreSQL has been the default DSpace backend since 2003, so there is a
>>> lot of experience with it in that context to draw on.
>>> 
>>> Pat Galloway
>>> School of Information
>>> University of Texas at Austin
> 
> 
> -- 
> Drs. Joris J. van Zundert
> 
> *Researcher & Developer Digital and Computational Humanities*
> Huygens Institute for the History of the Netherlands
> 
> *Royal Netherlands Academy of Arts and Sciences*
> http://www.huygens.knaw.nl/vanzundert/
> http://www.huygens.knaw.nl/vanzundert/
> http://www.huygens.knaw.nl/vanzundert/?lang=en
> 
> -------
> 
> *Jack Sparrow: I thought you were supposed to keep to the code.Mr. Gibbs:
> We figured they were more actual guidelines.*
> 
> 
> 
> --[2]------------------------------------------------------------------------
>        Date: Sun, 12 Oct 2014 07:42:48 -0400
>        From: Ed Summers <ehs at pobox.com>
>        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
>        In-Reply-To: <20141012065352.D4A425F70 at digitalhumanities.org>
> 
> 
> Hi Ashley, 
> 
> I worked at the Library of Congress on their Chronicling American website, which now provides access to 8 million historic newspaper pages. We had to rewrite the application 6 years ago so that it could have the impact we wanted it to have (indexed by Google, thousands of visitors/day, etc). 
> 
> It was initially built using Cocoon, Fedora Repository and we rewrote it using the Django web framework, which prescribes a relational database (we went with MySQL because that’s what our IT folks support) and we added full text search using Solr. 
> 
> We were lured by the siren song of making it a general purpose open source application, which never quite panned out IMHO. But the application did scale as we had hoped, and our usage went up by several orders of magnitude. So our stakeholders were happy, and so were we. Being able to get your production IT folks involved in the maintenance of your database can be a big win from a sustainability/sanity perspective.
> 
> One thing that we did not do in the process of switching was to get rid of our XML based workflow. XML is still used for data interchange between the NDNP partners, and also for data interchange with the future (digital preservation). We needed to write some programs to parse the XML data and load bits of it into MySQL and Solr. In the process of doing this I think we collectively learned more about the shape of our data, and were also able to easily generate some new admin reports that proved useful.
> 
> So, just because you are considering giving up your XML database does not necessarily mean you are giving up on your investment in XML data.  It just means you are using the XML in a different way. Wasn’t that kind of interoperability always the dream/goal of SGML/XML in the first place :-)
> 
> //Ed
> 
> PS. On the subject of NoSQL, one thing that you might want to consider is leap frogging over traditional client/server web frameworks (Django, Rails, etc) and creating a REST web service on top of Solr, which is then used by a JavaScript web framework (Boostrap, Angular, Ember, etc). This would allow you to simply use Solr, and not use a RDBMs like MySQL or PostgreSQL. The advantage here is that you won’t have to keep PostgreSQL and Solr synchronized. Also, your API could be used by mobile apps, and third parties. The disadvantage is that you will understand and constrain the logical model of your data less. It might be worth asking if your IT shop supports ElasticSearch in addition to Solr, since it offers a better more API, and was built to scale a bit better than Solr was.






More information about the Humanist mailing list