[Humanist] 29.769 pubs: EEBO OCR'd, needing correction

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Mar 10 09:14:54 CET 2016

                 Humanist Discussion Group, Vol. 29, No. 769.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Wed, 9 Mar 2016 13:05:10 -0600
        From: Laura Mandell <laura.mandell at gmail.com>
        Subject: EEBO available in TypeWright

*** Attachments:

EEBO in TypeWright

We are pleased to announce that the Mellon-funded Early Modern OCR Project
– eMOP – has completed running Optical Character Recognition Software on
the 138,538 documents in ProQuest’s Early English Books Online (EEBO), and
we are now making almost all of them available in 18thConnect.org for
correcting the OCR. Some document images were too poor to run through the
software, but we have loaded the resulting “dirty OCR” for 113,909
documents into the TypeWright tool at 18thConnect.org for crowd-sourced
correction (http://www.18thconnect.org/typewright/documents). We were able
to get an excellent contract with both ProQuest and Gale for all the
documents that are loaded into TypeWright, all of EEBO and
Eighteenth-Century Collections Online (ECCO): any scholar or student who
corrects a document gets to keep it to do whatever they wish with it,
ideally create an online digital edition such one you can see here, created
by an undergraduate student of Stephen Gregg’s:
 http://ahymntothepillory.blogspot.co.uk/ .

Once corrected, 18thConnect will send you the document in both plain-text
and TEI-encoded formats. Additionally, the full-text will then be full-text
searchable in both ProQuest and Gale’s EEBO and ECCO, and in
18thConnect.org. When you search the latter, 18thConnect gives search
returns in the form of links to the texts in EEBO or ECCO, but, for those
who use 18thConnect without subscriptions to those databases, we also
provide information about holding libraries. Moreover, for those who DO
subscribe to these catalogues, our research capacities will have been
increased by working on the data we care about. Please note that these
catalogs are being sold to libraries just as they are – in correcting the
data, we are NOT increasing the profits of these companies, only our own
research capacities (please see Mandell and Grumbach, “The Business of
Digital Humanities: Capitalism and Enlightenment,” Scholarly and Research
Communication 6.4 [2015]: http://src-online.ca/index.php/src/issue/current).

A word about search: although all of Gale’s ECCO is searchable by word, OCR
errors diminish the number of results one gets. A forthcoming article by
Mandell demonstrates that the error rate in searching for bigrams (two-word
phrases) is 50 to 60%--that is, one is missing over half the results one
might otherwise get.  In the case of EEBO, only those texts that have been
typed by the Text Creation Partnership are searched by word when you are
searching EEBO, as you can see on the EEBO search page, in the drop-down
box describing what is searchable:

[image: Inline image 1]

We sincerely hope that professors and students can work together to make
sure that these unstranscribed and poorly transcribed documents (the 85,200
documents so far not available to search as full text) do not become part
of a “dark archive,” but can be fully searchable by future generations of
scholars, both inside and outside the academy.

You can access the EEBO documents at http://www.18thConnect.org, using the
TypeWright tab, “Advanced Search,” or the Search Tab and selecting
“TypeWright Enabled Documents”; in both cases, also select “EEBO” under
“Other Collections.”

In addition to the instructions for using TypeWright available on the site
itself once you begin editing a document, we an introductory video
available:  http://www.18thconnect.org/about/typewright/#video. We also
have a few short videos available on a playlist on YouTube that introduces
TypeWright features one by one, and includes a video about editing EEBO
texts specifically, which pose their own kinds of problems:

Also, feel free to contact us with questions or concerns at
technologies at 18thConnect.org.

Laura Mandell
Director, Initiative for Digital Humanities, Media, and Culture
Professor, English
Texas A&M University
p: 979-845-8345
e: idhmc at tamu.edu

More information about the Humanist mailing list