[Humanist] 29.770 EEBO OCR'd

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Mar 11 09:45:44 CET 2016

                 Humanist Discussion Group, Vol. 29, No. 770.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Thu, 10 Mar 2016 08:33:43 -0600
        From: Aaron McCollough <amccollo at illinois.edu>
        Subject: Re:  29.769 pubs: EEBO OCR'd, needing correction
        In-Reply-To: <f354e7d2052a4537b98ee288d7047eca at CHIHT1.ad.uillinois.edu>

FWIW, this is an invidious characterization of the EEBO-Text Creation
Partnership... a project without which 18th Connect would not have been
feasible, and a project which has been facilitating significantly the
research of early modern scholars for over 15 years.

Nothing against Laura and her project, but it shouldn't be necessary to
throw shade on needful work that came before in order to bring one's own
work into the light.

Aaron McCollough (formerly an EEBO-TCP outreach librarian)

On Thu, Mar 10, 2016 at 2:14 AM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 29, No. 769.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>         Date: Wed, 9 Mar 2016 13:05:10 -0600
>         From: Laura Mandell <laura.mandell at gmail.com>
>         Subject: EEBO available in TypeWright
> *** Attachments:
> http://www.digitalhumanities.org/humanist/Attachments/1457550721_2016-03-09_humanist-owner@lists.digitalhumanities.org_21134.2.png
> EEBO in TypeWright
> We are pleased to announce that the Mellon-funded Early Modern OCR Project
> – eMOP – has completed running Optical Character Recognition Software on
> the 138,538 documents in ProQuest’s Early English Books Online (EEBO), and
> we are now making almost all of them available in 18thConnect.org for
> correcting the OCR. Some document images were too poor to run through the
> software, but we have loaded the resulting “dirty OCR” for 113,909
> documents into the TypeWright tool at 18thConnect.org for crowd-sourced
> correction (http://www.18thconnect.org/typewright/documents). We were able
> to get an excellent contract with both ProQuest and Gale for all the
> documents that are loaded into TypeWright, all of EEBO and
> Eighteenth-Century Collections Online (ECCO): any scholar or student who
> corrects a document gets to keep it to do whatever they wish with it,
> ideally create an online digital edition such one you can see here, created
> by an undergraduate student of Stephen Gregg’s:
> http://ahymntothepillory.blogspot.co.uk/
>  http://ahymntothepillory.blogspot.co.uk/ .
> Once corrected, 18thConnect will send you the document in both plain-text
> and TEI-encoded formats. Additionally, the full-text will then be full-text
> searchable in both ProQuest and Gale’s EEBO and ECCO, and in
> 18thConnect.org. When you search the latter, 18thConnect gives search
> returns in the form of links to the texts in EEBO or ECCO, but, for those
> who use 18thConnect without subscriptions to those databases, we also
> provide information about holding libraries. Moreover, for those who DO
> subscribe to these catalogues, our research capacities will have been
> increased by working on the data we care about. Please note that these
> catalogs are being sold to libraries just as they are – in correcting the
> data, we are NOT increasing the profits of these companies, only our own
> research capacities (please see Mandell and Grumbach, “The Business of
> Digital Humanities: Capitalism and Enlightenment,” Scholarly and Research
> Communication 6.4 [2015]: http://src-online.ca/index.php/src/issue/current
> ).
> A word about search: although all of Gale’s ECCO is searchable by word, OCR
> errors diminish the number of results one gets. A forthcoming article by
> Mandell demonstrates that the error rate in searching for bigrams (two-word
> phrases) is 50 to 60%--that is, one is missing over half the results one
> might otherwise get.  In the case of EEBO, only those texts that have been
> typed by the Text Creation Partnership are searched by word when you are
> searching EEBO, as you can see on the EEBO search page, in the drop-down
> box describing what is searchable:
> [image: Inline image 1]
> We sincerely hope that professors and students can work together to make
> sure that these unstranscribed and poorly transcribed documents (the 85,200
> documents so far not available to search as full text) do not become part
> of a “dark archive,” but can be fully searchable by future generations of
> scholars, both inside and outside the academy.
> You can access the EEBO documents at http://www.18thConnect.org, using the
> TypeWright tab, “Advanced Search,” or the Search Tab and selecting
> “TypeWright Enabled Documents”; in both cases, also select “EEBO” under
> “Other Collections.”
> In addition to the instructions for using TypeWright available on the site
> itself once you begin editing a document, we an introductory video
> available:  http://www.18thconnect.org/about/typewright/#video. We also
> have a few short videos available on a playlist on YouTube that introduces
> TypeWright features one by one, and includes a video about editing EEBO
> texts specifically, which pose their own kinds of problems:
> http://bit.ly/TW-features.
> Also, feel free to contact us with questions or concerns at
> technologies at 18thConnect.org.
> --
> Laura Mandell
> Director, Initiative for Digital Humanities, Media, and Culture
> Professor, English
> Texas A&M University
> p: 979-845-8345
> e: idhmc at tamu.edu
> @mandellc
> http://idhmc.tamu.edu


*Aaron McCollough*
Scholarly Communications & Publishing Librarian
Head, Scholarly Communications & Publishing Unit
Asst. Professor, University Library
Office 450-Y
University of Illinois at Urbana-Champaign
Tel. 217-265-5390

More information about the Humanist mailing list