[Humanist] 27.214 2,000 18C texts: first fruits

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Jul 16 13:38:58 CEST 2013


                 Humanist Discussion Group, Vol. 27, No. 214.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Mon, 15 Jul 2013 21:32:05 +0000
        From: Martin Mueller <martinmueller at northwestern.edu>
        Subject: Abbot MorphAdorner Collaboration

The Abbot MorphAdorner collaboration

The Center for Digital Research in the Humanities  http://cdrh.unl.edu/
at the University of Nebraska and Northwestern University's Academic and
Academic Research Technologies
 http://www.it.northwestern.edu/about/departments/at/  are pleased to
announce the first fruits of a collaboration between the Abbot
 http://abbot.unl.edu/  and EEBO-MorphAdorner projects: the release of
some 2,000 18th century texts from the TCP-ECCO collections in a TEI-P5
format and with linguistic annotation. More texts will follow shortly,
subject to the access restrictions that will govern the use of TCP texts
for the remainder of this decade.

The Text Creation Partnership (TCP) collection currently consists of about
50,000 fully transcribed SGML texts from the first three centuries of
English print culture. The collection will grow to approximately 75,000
volumes and will contain at least one copy of every book published before
1700 as well as substantial samples of 18th century texts published in the
British Isles or North America. The ECCO-TCP texts are already in the
public domain. The other texts will follow them between 2014 and 2015. The
Evans texts will be released in June 2014, followed by a release of some
25,000 EEBO texts in 2015.

It is a major goal of the Abbot and EEBO MorphAdorner collaboration to
turn the TCP texts into the foundation for a "Book of English," defined as

* a large, growing, collaboratively curated, and public domain corpus
* of written English since its earliest modern form
* with full bibliographical detail
* and light but consistent structural and linguistic annotation

Texts in the annotated TCP corpus will exist in more than one format so as
to facilitate different uses to which they are likely to be put. In a
first step, Abbot transforms the SGML source text into a TEI P5 XML
format. Abbot, a software program designed by Brian Pytlik Zillig and
Stephen Ramsay, can read arbitrary XML files and convert them into other
XML formats or a shared format. Abbot generates its own set of conversion
routines at runtime by reading an XML schema file and programmatically
effecting the desired transformations. It is an excellent tool for
creating an environment in which texts originating in separate projects
can acquire a higher degree of interoperability. A prototype of Abbot was
used in the MONK project to harmonize texts from several collections,
including the TCP, Chadwyck-Healey's Nineteenth-Century Fiction, the
Wright Archive of American novels 1851-1875, and Documenting the American
South.

This first transformation maintains all the typographical data recorded in
the original SGML transcription, including long 's', printer's
abbreviations, superscripts etc. In a second step MorphAdorner tokenizes
this file. MorphAdorner  http://morphadorner.northwestern.edu/  was
developed by Philip R. Burns. It is a multi-purpose suite of NLP tools
with special features for the tokenization, analysis, and annotation of
historical corpora. The tokenization uses algorithms and heuristics
specific to the practices of Early Modern print culture, wraps every word
token in a <w> element with a unique ID, and explicitly marks sentence
boundaries.
In the next step (conceptually different but merged in practice with the
previous), some typographical features are removed from the tokenized
text, but all such changes are recorded in a change log and may therefore
be reversed. The changes aim at making it easier to manipulate the corpus
with software tools that presuppose modern printing practices. They
involve such things as replacing long 's' with plain 's', or resolving
unambiguous printer's abbreviations and superscripts.

The tokenized version of the text will be very useful to scholars who have
an interest in original spelling editions and want to use the TCP
transcriptions as a point of departure for projects that 'upcode' selected
texts by comparing them with the page images and encoding typographical
detail with greater precision.
Unlike many other NLP programs, MorphAdorner treats tokenization and
annotation as separate procedures. This means that in a second pass over a
text (or part of it), you can compare results or protect manual
corrections introduced in a first run. This makes it much easier to manage
the progressive improvement of a corpus over time.

MorphAdorner's annotations associate each token with a part-of-speech tag,
a lemma or dictionary entry form of a word in its modern form, and a
standardized spelling. For the TCP project MorphAdorner uses the NUPOS tag
set developed by Martin Mueller. NUPOS accommodates the morphological
variance of English from Chaucer to the present day within a single tag
set that requires minimal compromise. A "MorphAdorned" file can be
produced as a TEI-P5 file or as a "verticalized" file in which every token
is a column in a table row whose other columns describe its lexical,
structural, and grammatical properties.

Both of these outputs are machine-actionable rather than human-readable.
It is important to keep in in mind that the uses for annotation of this
kind extend far beyond the discipline specific needs of linguists. Think
of an "Abbotized" and "MorphAdorned" corpus as a second-generation digital
library in which primary texts that form the documentary infrastructure
for text-centric scholarly work are submitted to three cataloguing
operations that create the potential for querying a corpus separately or
in combination by criteria from

1. the top or bibliographical level of a document as a whole
2. the middle level of the discursive structure of each text
3. the bottom level of words and sentences.

In a corpus that spans the orthographic and morphological variance of
several centuries linguistic annotation serves both to articulate and
erase difference, as when 'louyth' and 'gelosy' are mapped to 'love' and
'jealousy'. You can algorithmically or with very limited manual
intervention create modern spelling editions of any text. You can also
look for other instances of such abstract patterns as "soft, gentle, and
low" or "handsome, clever, and rich."

We hope that the text corpora created with Abbot and MorphAdorner will
spur the development of easy-to-use search engines that will make it
possible for scholars to find new ways of exploring historical corpora for
thematic, stylistic or other purposes. The latest version of the
Philologic search engine already processes MorphAdorned texts.
MorphAdorner can also produce output for use by the Sketch engine and by
BlackLab, a corpus retrieval engine based on Lucene. The prototype of a
BlackLab search interface with ECCO texts is available at
http://devadorner.northwestern.edu/corpussearch/

The tabular representation of Abbotized and MorphAdorned text creates an
excellent basis for the collaborative curation of TCP corpora, especially
the correction of several million incompletely or incorrectly transcribed
words. In 2010 several undergraduates, supervised by Martin Mueller, fixed
about 20,000 errors in some 280 non-Shakespearean TCP plays, using
Annolex, a Web-based curation tool designed by Craig Berry. This summer
five undergraduates will work with Mueller on a a similar project, called
"Shakespeare His Contemporaries," using an improved version of Annolex and
hoping to correct most of the wrongly or incompletely transcribed words in
some 600 plays printed before 1660.

The annotated ECCO files contain some known errors that will require
modifying the training data in subsequent releases. In particular,
upper-case occurrences of some words ('Case', 'Care') have been tagged as
verbs when they are in fact nouns. The inconsistent capitalization of
Early Modern English poses surprisingly difficult challenges.

Abbot and EEBO-MorphAdorner were funded by the Andrew W. Mellon
Foundation, as were the WordHoard and MONK projects in which much of the
preliminary work was done. Additional funding for EEBO-MorphAdorner was
provided by the Center for Library Inititatives at the CIC, the Ford
Center for Global Citizenship at Northwestern's Kellogg School of
Management, the Northwestern University Library, and Proquest.





More information about the Humanist mailing list