[Humanist] 28.393 PostGreSQL and Solr for digital archives

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Oct 13 07:25:56 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 393.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Joris van Zundert <joris.van.zundert at huygens.knaw.nl>    (160)
        Subject: Re:  28.391 PostGreSQL and Solr for digital archives

  [2]   From:    Ed Summers <ehs at pobox.com>                                 (8)
        Subject: Re:  28.391 PostGreSQL and Solr for digital archives


--[1]------------------------------------------------------------------------
        Date: Sun, 12 Oct 2014 10:16:35 +0200
        From: Joris van Zundert <joris.van.zundert at huygens.knaw.nl>
        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141012065352.D4A425F70 at digitalhumanities.org>


Hi,

Has a very thorough needs and capabilities analysis been drawn up?
Switching technologies is something best not considered at a whim.

This is one of those odd things I see happening a lot in technological
contexts. It is like switching from a hole punch to a scissors because your
office supplier doesn't know or is not able to deliver hole punchers.
Although I sympathize with already overburdened IT support departments, the
motivation is solely on the side of the IT capabilities. But one should
consider foremost if the project is served. Secondly, in the long run this
may be a more costly choice because you'll need to rebuild and add all the
XML support that's inbuilt into eXist. Be careful to not to use a hammer to
drive in a screw.

Best
--Joris

On Sunday, October 12, 2014, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 28, No. 391.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
> <javascript:;>
>
>
>
>         Date: Sun, 12 Oct 2014 00:04:01 +0000
>         From: "Reed, Ashley" <reeda at email.unc.edu <javascript:;>>
>         Subject: Re:  28.386 PostGreSQL and Solr for digital archives
>         In-Reply-To: <20141011064115.3D39C661C at digitalhumanities.org
> <javascript:;>>
>
>
> Martin has his hit the nail on the head: "what your shop is familiar
> with.” It’s not feasible for our hosts to support eXist for a single
> project (and no one else on campus uses it), so as we reimplement the site
> we’re trying to find a solution that will allow us to maintain our high
> standards without placing undue burden on the people and units that keep us
> up and running.
>
> Thanks for these (and any other forthcoming) replies.
>
> Best,
> Ashley
>
> On Oct 11, 2014, at 2:41 AM, Humanist Discussion Group <
> willard.mccarty at mccarty.org.uk <javascript:;>> wrote:
>
> >
> >
> >
> --[1]------------------------------------------------------------------------
> >        Date: Fri, 10 Oct 2014 15:53:32 +1000
> >        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com
> <javascript:;>>
> >        Subject: Re:  28.383 PostGreSQL and Solr for digital archives?
> >        In-Reply-To: <20141010045842.AFE8265D1 at digitalhumanities.org
> <javascript:;>>
> >
> >
> > Hi Ashley,
> >
> > Postgresql is a quality relational database that has been around a long
> > time. It's reliable and fast, but it is not a drop-in replacement for
> > eXist. Unlike eXist its support for XML is rudimentary. So I'm kind of
> > curious as to why this apparent shift away from XML.
> >
> > Desmond Schmidt
> > Queensland University of Technology
> >
> > On Fri, Oct 10, 2014 at 2:58 PM, Humanist Discussion Group <
> > willard.mccarty at mccarty.org.uk <javascript:;>> wrote:
> >
> >>                 Humanist Discussion Group, Vol. 28, No. 383.
> >>            Department of Digital Humanities, King's College London
> >>                       www.digitalhumanities.org/humanist
> >>                Submit to: humanist at lists.digitalhumanities.org
> <javascript:;>
> >>
> >>
> >>
> >>        Date: Thu, 9 Oct 2014 23:35:05 +0000
> >>        From: "Reed, Ashley" <reeda at email.unc.edu <javascript:;>>
> >>        Subject: PostGreSQL and Solr for digital archives
> >>
> >> The William Blake Archive is in the process of migrating our site off of
> >> the eXist platform as part of a larger reimplementation and redesign
> >> project. Our partners have recommended that our next iteration employ
> >> PostGreSQL (for the web application) and Solr (for searching). We are
> >> curious to know whether other digital humanities projects (XML-based
> >> digital archives in particular) use this combination of platforms. Solr
> >> seems to be widely used for faceted searching, but we know less about
> >> PostGreSQL and would like to know if other projects have employed it
> and,
> >> if so, what your experiences have been.
> >>
> >> Thanks in advance for information and advice.
> >>
> >> Ashley Reed
> >> Andrew W. Mellon Postdoctoral Fellow in Digital Humanities, Carolina
> >> Digital Humanities Initiative
> >> Consultant, William Blake Archive
> >
> >
> >
> >
> >
> --[2]------------------------------------------------------------------------
> >        Date: Fri, 10 Oct 2014 11:48:46 +0000
> >        From: Martin Mueller <martinmueller at northwestern.edu
> <javascript:;>>
> >        Subject: Re:  28.383 PostGreSQL and Solr for digital archives?
> >        In-Reply-To: <20141010045842.AFE8265D1 at digitalhumanities.org
> <javascript:;>>
> >
> >
> > I would be interested to learn why you want to move away from an XML
> > database when the data are XML data to begin with, especially at a time
> > when there is so much talk about the advantages of 'NoSQL'.But if you do
> > the choice between Postgresql and MySQL is probably six of one and half a
> > dozen of the other and has less to do with intrinsic advantages than with
> > what your shop is familiar with.
> >
> > Martin Mueller
> > Professor emeritus of English and Classics
> > Northwestern University
> >
> >
> >
> --[3]------------------------------------------------------------------------
> >        Date: Fri, 10 Oct 2014 09:39:55 -0500
> >        From: Patricia Galloway <galloway at ischool.utexas.edu
> <javascript:;>>
> >        Subject: Re: PostGreSQL and Solr for digital archives
> >        In-Reply-To: <
> mailman.3.1412935202.7328.humanist at lists.digitalhumanities.org
> <javascript:;>>
> >
> >
> > On 10/10/2014 5:00 AM, humanist-request at lists.digitalhumanities.org
> <javascript:;> wrote:
> >> PostGreSQL and Solr for digital archives
> >
> > This is the default combination for the current version of DSpace and
> > PostGreSQL has been the default DSpace backend since 2003, so there is a
> > lot of experience with it in that context to draw on.
> >
> > Pat Galloway
> > School of Information
> > University of Texas at Austin


-- 
Drs. Joris J. van Zundert

*Researcher & Developer Digital and Computational Humanities*
Huygens Institute for the History of the Netherlands

*Royal Netherlands Academy of Arts and Sciences*
 http://www.huygens.knaw.nl/vanzundert/
http://www.huygens.knaw.nl/vanzundert/
 http://www.huygens.knaw.nl/vanzundert/?lang=en

-------

*Jack Sparrow: I thought you were supposed to keep to the code.Mr. Gibbs:
We figured they were more actual guidelines.*



--[2]------------------------------------------------------------------------
        Date: Sun, 12 Oct 2014 07:42:48 -0400
        From: Ed Summers <ehs at pobox.com>
        Subject: Re:  28.391 PostGreSQL and Solr for digital archives
        In-Reply-To: <20141012065352.D4A425F70 at digitalhumanities.org>


Hi Ashley, 

I worked at the Library of Congress on their Chronicling American website, which now provides access to 8 million historic newspaper pages. We had to rewrite the application 6 years ago so that it could have the impact we wanted it to have (indexed by Google, thousands of visitors/day, etc). 

It was initially built using Cocoon, Fedora Repository and we rewrote it using the Django web framework, which prescribes a relational database (we went with MySQL because that’s what our IT folks support) and we added full text search using Solr. 

We were lured by the siren song of making it a general purpose open source application, which never quite panned out IMHO. But the application did scale as we had hoped, and our usage went up by several orders of magnitude. So our stakeholders were happy, and so were we. Being able to get your production IT folks involved in the maintenance of your database can be a big win from a sustainability/sanity perspective.

One thing that we did not do in the process of switching was to get rid of our XML based workflow. XML is still used for data interchange between the NDNP partners, and also for data interchange with the future (digital preservation). We needed to write some programs to parse the XML data and load bits of it into MySQL and Solr. In the process of doing this I think we collectively learned more about the shape of our data, and were also able to easily generate some new admin reports that proved useful.

So, just because you are considering giving up your XML database does not necessarily mean you are giving up on your investment in XML data.  It just means you are using the XML in a different way. Wasn’t that kind of interoperability always the dream/goal of SGML/XML in the first place :-)

//Ed

PS. On the subject of NoSQL, one thing that you might want to consider is leap frogging over traditional client/server web frameworks (Django, Rails, etc) and creating a REST web service on top of Solr, which is then used by a JavaScript web framework (Boostrap, Angular, Ember, etc). This would allow you to simply use Solr, and not use a RDBMs like MySQL or PostgreSQL. The advantage here is that you won’t have to keep PostgreSQL and Solr synchronized. Also, your API could be used by mobile apps, and third parties. The disadvantage is that you will understand and constrain the logical model of your data less. It might be worth asking if your IT shop supports ElasticSearch in addition to Solr, since it offers a better more API, and was built to scale a bit better than Solr was.





More information about the Humanist mailing list