[Humanist] 22.607 reliability of digital texts

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Mar 11 07:21:50 CET 2009

Humanist Discussion Group, Vol. 22, No. 607.
Centre for Computing in the Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org

[1]   From:    Siobhan King <Siobhan.King at oag.govt.nz>                  (264)
Subject: RE: [Humanist] 22.604 reliability of digital texts

[2]   From:    DrWender at aol.com                                         (159)
Subject: Re: [Humanist] 22.604 reliability of digital texts

Date: Wed, 11 Mar 2009 09:05:08 +1300
From: Siobhan King <Siobhan.King at oag.govt.nz>
Subject: RE: [Humanist] 22.604 reliability of digital texts
In-Reply-To: <20090310072622.731C22F28E at woodward.joyent.us>

If you are interested in standards for digital repositories you might want to look over the DRAMBORA(Digital Repository Audit Method Based on Risk Assessment) toolkit

The toolkit aims to address issues around risks involved with long-term preservation in digital repositories. However I'm not sure how much it goes into process analysis to ensure quality of digitisation of texts. I'm only assuming so as their tutorials cover workflows (http://www.dcc.ac.uk/events/drambora-london-2007/)

Hope this is useful.

Siobhan King

Date: Tue, 10 Mar 2009 22:22:16 EDT
From: DrWender at aol.com
Subject: Re: [Humanist] 22.604 reliability of digital texts


Before I was retired I worked as computer man in the team of an edition of
Goethes Complete Works, and in this context I was often confronted with
questions of reliability; but also as teacher I was discussing in classroom
the reliability of internet ressources. Perhaps you may appreciate a
bibliographical indication for case studies, now 10 years old:

Herbert Wender und Robert Peter: Probleme der Wiederverwendung elektronisch
gespeicherter Texte. Zwei Fallstudien. In: Computergestützte Text-Edition.
Hrsg. von Roland Kamzalek (Beihefte zu editio 12). Tübingen 1999, S. 47-60.
(cf. the short mention of this article: "Herbert Wender und Robert Peter
diskutierten unter der Leitfrage »Wie verlässlich sind eigentlich die
Texte, die auf dem Computer zur Verfügung stehen?« die auch im Titel ihres
Beitrags so benannten Probleme der Wiederverwendung elektronisch
gespeicherter Texte. Ein Kafka-Text (Das Urteil) und ein Goethe-Brieftext
(aus den Briefen aus der Schweiz) boten ihre Fälle. Von Interessen der
Unterrichtssituation ebenso wie der re-editorischen Integration geleitet,
boten sie Einblicke in eigene Programmierungen, die Schwachstellen und
Fehler der bezogenen Datenspeicherungen aufdeckten und auszumerzen suchten
und verlässliche elektronische Weiterverwendungen anstrebten."
[http://computerphilologie.uni-muenchen.de/jg01/gabler.html] )

Important for the diff-based procedure of quality testing - in my time we
used the "diff" tool in Unix context or, in cases of slightly different
versions, WORD's 'version control'; nowadays one would probably prefer some
XML/XSLT stuff, but the results will be the same - most important is the
existence of at least 2 versions of the same literary text where the one is
*independent* from the other. (What happens when both ressources are
OCR-based must show the experience...) In our case study we had tested
Kafkas story "Das Urteil" in versions from the E-Lib in Virginia and from
the German Gutenberg Project: While in the german textbase the name of the
protagonist was 'americanized' ("Bendeman" instead of "Bendemann") and some
typing errors occur, the american encoder resp. OCR checker or corrector had
obviously problems with german diphtongs ("im ubrigen" instead of "im

BTW: The situation is now the same as before 10 years. Download the 2
versions from
http://gutenberg.spiegel.de/?id=12&xid=1353&kapitel=26&cHash=60edaf8b082 and
from http://etext.lib.virginia.edu/etcbin/toccer-old?id=KafUrte&images=images/modeng&data=/lv1/Archive/german-parsed&tag=public&part=1&division=div

Put this stuff in 2 WORD files and set back both to "Standard" paragraph
format, compare the texts (under menu EXTRAS) and behind some minor garbage
indicating different coding conventions you will see the substantial
differences between the versions, showing that both are without corrections
not to use for citation in scientific contexts.

In the USA for printed editions was established - when I see it right - acheck by the Bibliographical Society (ironically in Virginia too), and the
reader can find an indication of the so-checked 'reliability' in the book.
Couldn't they build in the same way a list of checked electronic text
ressources? Or better, why are the public funded scientific editions of
literary texts, since many years produced with electronic aid, not freely
accessible in digital representations?


More information about the Humanist mailing list