[Humanist] 31.165 best practice for sustainable databases
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Mon Jul 10 08:32:31 CEST 2017
Humanist Discussion Group, Vol. 31, No. 165.
Department of Digital Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Fri, 7 Jul 2017 12:07:37 +0100
From: Gabriel Egan <mail at gabrielegan.com>
Subject: Re: [Humanist] 31.121 best practice for sustainable databases?
In-Reply-To: <20170622054511.E8D401B41 at digitalhumanities.org>
I waited to see if Sinai Rusinek (posting of 22
June 2017) would get a flood of expert advice on
making a couple of databases, one built in Microsoft
Access and one built in Drupal, "open, reusable and
sustainable". The associated data consists of PDFs
with an OCR layer, audio files, and video files. There
was I think only one posted response, so I'll give my
advice based on preserving for online consumption a
few databases of a few tens-of-thousands of records
(texts and images) each, originally made in the 1990s
and early 2000s.
The first two desiderata, "open" and reusable", are
much easier to achieve than the last one, "sustainable",
if by "sustainable" we mean that someone will be able
to add new records to the databases in the future. If
by "sustainable" we mean merely that the existing
records will be retained in a form that can be
preserved (kept available to users) indefinitely,
then "sustaintable" is pretty easy to achieve too.
So, assuming that the desire is to keep the existing
records available to users and the data open and
reusable, I would recommend:
* Convert all the PDFs to PDF/A, the open archive format
ratified by the International Standards Organization
* Convert all the audio files to WAV format. Although
this is not an open format, it's so widely used that
it's nearly as good as one. There isn't an open format
for audio that is also widely used now. When one
emerges, you can transcode to that.
* Convert all the video files to H.264/MPEG-4. Again,
not an open format but a de facto standard that won't
suddenly become obsolete. There isn't an open format
for video that is also widely used now. When one
emerges, you can transcode to that.
* Have the Access database and Drupal database each
export their records as static HTML with links to
all the converted content text, sound, and video.
A static HTML database will continue to work for
many years without modification or maintenance.
Because it exists as a series of read-only HTML
files on your web-server, with no software running
on the server side other than the simplest of web-
server programs, it is virtually immune to hacking
and has no ongoing maintenance cost. If your server
does get hacked, you upload the full set of HTML
and content files to some other simple web-server
and carry on: no mess to clean up, no patching to
do. (Obviously, you need to keep a spare copy of
the entire static-HTML dataset in a safe place to
BUT a static HTML database is not easy to add
new records to. In the 'export' process, Access
and Drupal will have made decisions about how
to name each HTML file and each cross-reference
in the record-set that they won't have asked you
about, so to add new records you need to figure
out what conventions they were using. This is
do-able, but not straightforward.
So, the key question is "do I want to be able
to add new records to these databases in the
future?" If you don't, spin off static-HTML
versions of them and think of them as essentially
fixed archives. If you do, you need either to
maintain a Content Management System (like Drupal)
to do the work, or else see if someone can figure
out how to add records to the static HTML
I wouldn't hold them up as "examples of best practice"
in this regard, but the projects "Modernist Magazines"
and "The Hockliffe Project" and "Caxton's Chaucer"
at http://cts.dmu.ac.uk are working examples of static
HTML spinoffs from Content Management System databases
(one of which was Drupal, I believe) and they meet the
basic need of keeping the digital materials online,
findable, and reusable.
Professor Gabriel Egan, De Montfort University. www.gabrielegan.com
Director of the Centre for Textual Studies http://cts.dmu.ac.uk
National Teaching Fellow http://www.heacademy.ac.uk/ntfs
Gen. Ed. New Oxford Shakespeare http://www.oxfordpresents.com/ms/nos
On 6/22/2017 6:45 AM, Humanist Discussion Group wrote:
> Humanist Discussion Group, Vol. 31, No. 121.
> Department of Digital Humanities, King's College London
> Submit to: humanist at lists.digitalhumanities.org
> Date: Wed, 21 Jun 2017 19:31:06 +0300
> From: Sinai Rusinek <sinai.rusinek at mail.huji.ac.il>
> Subject: sustainable databases best practice
> Dear all,
> I am writing for your advice regarding two cases of database projects in
> our Humanities faculty, which are, I believe, symptomatic:
> One was built a few years ago as an information systems student
> as an Access DB and consists of many thousands of pdf's of short texts -
> only partly OCR'd, with varied fields of rich metadata. The other was
> on a Drupal platform and consists of a growing number of sound and video
> files, transcribed and with fields of rich metadata. Both cannot be
> supported any longer by their original builders, and in both projects
> are some funds to invest in the restructuring of the database. I
> to use this opportunity to make sure the projects move to an open,
> and sustainable model. The problem: there is no DH lab or consultancy
> around yet, and as much as we are hoping that this will change soon, we
> have to take decisions fast in these two cases.
> Could you share some tips, dos and don'ts, or refer my to any examples of
> best practice regarding databases?
> All best,
> Sinai Rusinek
> Digital Humanities @ Haifa University http://dighum.haifa.ac.il/
> Digital Humanities Israel http://www.thedigin.org/en/#
More information about the Humanist