[Humanist] 27.783 Perseus and Leipzig corpora

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Feb 10 07:03:42 CET 2014

                 Humanist Discussion Group, Vol. 27, No. 783.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Sat, 8 Feb 2014 17:15:00 -0500
        From: Gregory Crane <gregory.crane at TUFTS.EDU>
        Subject: the Perseus Corpus and the Leipzig Corpus of Open Greek and Latin

Dear List Members,

I have prepared a brief description of two projects. (1) the Perseus 
Corpus of Open Greek and Latin -- essentially and revision of the source 
texts in Perseus (and including a number not yet available on the 
Perseus site) and (2) the Leipzig Corpus of Open Greek and Latin, which 
will build upon, but should be much larger than, the initial Perseus 
Corpus. Downloading the Leipzig Corpus will get you the Perseus Corpus 
as well.

We have resources in hand to begin the Leipzig Corpus and new materials 
should start to appear in 2014 and in early 2015 but our hope is to add 
another 300 million words of Greek and Latin TEI EpiDoc XML and we are 
submitting a proposal to the German Science Foundation Digitization and 
Cataloging Program. I am looking for feedback. The DFG program requires 
matching funds -- you match 1 euro and DFG can give you 2 euros. Because 
of the Humboldt Chair, we can make the match but we need to get started 
if we want a chance at a second possible 3 year cycle beginning during 
the five year Humboldt startup funding. We hope to submit the proposal 
by March 1.

You can find a fuller description below:


The current working abstract for the proposal follows:

The Open Philology Project proposes to use public domain editions 
(including under German law editions published as late as 1991 when the 
project concludes in 2017) as the foundation for the Leipzig Corpus of 
Open Greek and Latin, available under a CC-BY-SA license, including both 
Classical and Byzantine Greek as well as Latin works produced both 
during and after Classical antiquity. Our goal is to provide 
comprehensive coverage of surviving Greek and Latin sources composed 
through 600 CE, to begin providing multiple editions for many works 
aligned with one another, and to provide a solid foundation for 
Byzantine Greek and the massive body of post-classical Latin. Building 
directly upon more than 25 years of continuous research and development 
by the Perseus Digital Library, upon recent breakthrough work on OCR for 
Classical Greek, upon scanned books available from mass digitization 
projects and upon preliminary work begun at Leipzig in May 2013, the 
Leipzig Corpus contains three components: (1) c. 400 million words of 
corrected OCR source texts, with FRBR-work based metadata, including 
least one edition of every major extant work produced through 600 CE as 
well as substantial initial coverage of post-classical materials, 
including corrected transcriptions of the textual notes and Text 
Encoding Initiative (TEI) XML encoding that captures at least one 
established citation scheme; (2) automatically generated metadata for 
all texts and curated metadata for as much of the collection as possible 
(including lemmatization and morpho-syntactic analysis, classification 
and identification of named entities, identification of text reuse).

More information about the Humanist mailing list