[Humanist] 22.444 texts on WordHoard

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Jan 13 07:42:20 CET 2009

                 Humanist Discussion Group, Vol. 22, No. 444.
         Centre for Computing in the Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Mon, 12 Jan 2009 14:50:05 -0600
        From: Martin Mueller <martinmueller at NORTHWESTERN.EDU>
        Subject: 334 texts of Early Modern English Drama available on WordHoard

Humanists whose institutions are subscribers to the Text Creation  
Partnership (TCP) may be interested in an experimental implementation  
of WordHoard that surrounds all of Shakespeare with ~300 other English  
plays from the TCP archive published between 1520 and 1660.  This is  
available at http://wordhoard.northwestern.edu.  We have enabled  
access to a group of CIC universities, but I know that there are many  
other subscribers on both sides of the Atlantic and beyond. We do not  
quite have a bullet-proof production service, but we would like to  
accommodate users. If you are in a subscribing institution and would  
like access please send me an email.  We will do our best to  
accommodate all requests that we can manage in the current environment.

What can you do with Early Modern English plays in WordHoard that you  
cannot do just as easily or better with Chadwyck-Healey's Literature  
Online, the Michigan interface, or Mark Olsen's Philologic?

'Remediation' is a useful term coined by Jay Bolter and Richard Grusin  
in their recent book of that title. It draws attention to the way in  
which changes in medium shape or inflect your encounter with an  
object. In Wordhoard we tried to build on and improve on philological  
tools and procedures that make it easier to go from the word here to  
the words there. Such procedures have a history that reaches back to  
the biblical concordances of medieval monks or Jefferson's Lazy Susan  
bookstand (http://monticellostore.stores.yahoo.net/110000.html). Try  
looking at two different pages of the same book at the same time.  
WordHoard is to my knowledge the only digital tool that does a good  
job of displaying two arbitrarily chosen pages in the same field of  

"Save the time of the reader" is Ranganathan's fourth law of library  
science. WordHoard's very flexible concordance features are a faithful  
and ingenious application of that law. Digital texts may be said to  
carry a built-in concordance with them. Where a search retrieves only  
a handful of hits, readers 'eyeball' them quickly and informally.  But  
as the list of hits grows, working through it becomes a tedious very  
soon. Wordhoard will let you group, sort, and keep count of hits by  
author, work, date, preceding or following word, part of speech,  
spelling, or lemma. The reshuffling of results is instant and makes it  
much easier to get an overview of the distribution of common words,  
such as 'honour' or 'think'.

Wordhoard is typically slower than Philologic in retrieving an initial  
result list, but  the overall time cost for a given search will nearly  
always be lower because Wordhoard has much more powerful affordances  
for postprocessing an initial result set.

The reshuffling prowess of Wordhoard shows up in two specialized  
features. There is a table of contents in which the plays can be  
displayed by author, date, or genre (The date and genre assignments  
are based on the Annals of English Drama by Harbage and Schoenbaum).   
There is also a Lexicon of all lemmas used in Early Modern Drama. It  
gives you the document and collection frequencies for each lemma, that  
is to say the number of plays in which it occurs and the total count.  
You can sort and filter this lexicon in various ways. You can also cut  
and paste it into Excel, which lets you manipulate the data with even  
greater flexibility.

The texts for the WordHoard edition of Early Modern Drama come from  
the Text Creation Partnership -- the same texts that you consult in  
Philologic. But the texts have been morphosyntactically tagged and  
lemmatized. Thus a search for a lemma or modern dictionary entry form  
of a word will retrieve all orthographic and morphological variants of  
that word. You can search across the 334 plays as if they were written  
and spelled in modern English.

Several cautions are in order here. First a word about the texts. The  
transcriptions are full of gaps, letters or words that the  
transcribers could not decipher. They cry out for user-contributed  
error corrections. Improving these texts over time will be a  
worthwhile task for the community of users.

Second, the process of 'linguistic annotation', to use the term of art  
for lemmatization and morphosyntactic tagging, is error prone. It is  
done automatically, using either rule-driven or probabilistic  
taggers.  Such taggers achieve accuracy rates of 97% when working with  
modern texts in standardized spelling.  The tagger used in WordHoard,  
Phil Burns' MorphAdorner, does a remarkably good job with Early modern  
English, but there are lots of errors. We have not yet had the time to  
review the plays.  Even in its current form the results will be useful  
for many purposes, but there is much room for improvement, and at some  
point in 2009 we will have a release that will fix quite a few  
errors.  The Shakespeare data are cleaner because they have been  
checked for errors on numerous occasions. On the other hand, the  
Shakespeare data have a different history -- the texts do not come  
from the TCP collection -- and there are some residual inconsistencies  
in tagging and lemmatization.

Third, the display of acts, scenes, prefaces, prologues, and the like  
is governed by the structure of the SGML source files from the Text  
Creation Partnership. What with the variable practices of early modern  
printers and the no less variable practices of contemporary encoders,  
there are a lot of odd features. Some of them show up only when you do  
display them in an environment  that is as elegantly and consistently  
designed as John Norstad's 'digital page.' Here, too, there is room  
for improvement, and I hope that within months we will have an updated  
versions in which the encoding practices are standardized so that  
variance reflects more faithfully the practices of early modern  

WordHoard has two statistical features that help with J. B. Firth's  
dictum and advice "you shall know a word by the company it keeps":  
collocation analysis and Dunning's log likelihood test. The former  
lets you identify other words that are characteristically associated  
with a given word in an author. The difference of the 'associates' of  
'honour' in Chaucer and Shakespeare, for instance is very striking.

Dunning's log likelihood statistic is an excellent tool for  
identifying words that are disproportionately common or rare in one  
set of texts when compared with another. You can construct your own  
sets for that purpose, but there are a number of prefabricated work  
sets, including not only the entire corpus of Early Modern Drama, but  
the works of each other, and aggregates by genre, such as 'comedy',  
'history', and 'tragedy'.  WordHoard now has a 'tag cloud' feature   
uses typefont size and color (black/grey) to visualize the results of  
a log likelihood test. Phil Burns is responsible for the ingenious  
version of the tag cloud, which lets you remove statistical outliers  
and focus on the middle as well as the top of a range.

To sum up: for any inquiry  that benefits from close attention to  
verbal detail the remediation of a very substantial proportion of  
Early Modern English drama creates a digital environment with  
affordances that are not matched elsewhere. There are obvious ways in  
which the texts and the interface can benefit from further work. If  
enough users care about it,  those improvements move into the range of  
the possible.

Martin Mueller
Professor English and Classics
Northwestern University

More information about the Humanist mailing list