[Humanist] 22.540 a control corpus of 19th century American intellectual writing?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Feb 17 07:44:47 CET 2009


                 Humanist Discussion Group, Vol. 22, No. 540.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        From: Humanist Discussion Group <willard.mccarty at mccarty.org.uk>


                 Humanist Discussion Group, Vol. 22, No. 539.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Martin Mueller <martinmueller at northwestern.edu>           (61)
                        Americanintellectual writing?

  [2]   From:    Mark Davies <Mark_Davies at byu.edu>                         (23)
        Subject: RE: [Humanist] 22.535 a control corpus of 19th century
                Americanintellectual writing?


--[1]------------------------------------------------------------------------
        Date: Mon, 16 Feb 2009 07:44:35 -0600
        From: Martin Mueller <martinmueller at northwestern.edu>
        Subject: Re: [Humanist] 22.535 a control corpus of 19th century Americanintellectual writing?
        In-Reply-To: <20090216120717.40E3D2DB4B at woodward.joyent.us>

John,

This is not a direct answer to your query but some of it may be  
relevant anyhow.

In the Monk Project we have developed procedures for taking texts from  
various TEI archives and move them into a common TEI P5 environment  
that makes them more interoperable and supports easy linguistic  
annotation.  We have used MorphAdorner, developed by Phil Burns at  
Northwestern University, to create texts that are lemmatized, have  
virtual orthographic standardization, and part-of-speech tagging.  
Brian Pytlik Zillig and Steve Ramsay at Nebraska have been responsible  
for the architecture and details of this process.

Some 300 texts from the Wright Archive of American fiction are  
available in a linguistically format right now. There are another 700  
Wright novels that can be done that way. Another 2,000 didn't go  
through the last editorial review when Perry Willet did the project,  
but if I understand him correctly, a needed text can be brought up to  
snuff within a couple of hours.

We also have ~100 earlier American texts from the public archive of  
the University of Virginia Early American fiction project.

The very large and diverse archive of 'Documenting the American South'  
at the University of North Carolina has practiced sparse but  
consistent annotation over the years.  DocSouth texts would yield with  
little or no modification to linguistic annotation.

Texts in these TEI archives all have much better bibliographical  
information than what is typically found in Project Gutenberg texts.

MM
On Feb 16, 2009, at 6:07 AM, Humanist Discussion Group wrote:

>                Humanist Discussion Group, Vol. 22, No. 535.
>        Centre for Computing in the Humanities, King's College London
>                      www.digitalhumanities.org/humanist
>               Submit to: humanist at lists.digitalhumanities.org
>
>
>
>       Date: Mon, 16 Feb 2009 10:57:55 +0000
>       From: "Bradley, John" <john.bradley at kcl.ac.uk>
>       >       In-Reply-To: <20090216064439.639862E171 at woodward.joyent.us>
>
> I am supervising a student at CCH/KCL who is working with the  
> writings of several American scholars from the 19th century.  At  
> this point his work would benefit from having a control corpus of  
> 19th century American intellectual writing that he could use for  
> various kinds of statistical comparison.  He would welcome both  
> literary-oriented and non-literary (scientific?) texts, and even  
> material that although written for an intellectual audience appeared  
> in the non-scholarly press.
>
> He is checking out what is available in Project Gutenberg already.  
> Is there a member of Humanist who could suggest other possible  
> digital textual sources?
>
> Many thanks for your suggestions.
>
> ... john bradley
>


--[2]------------------------------------------------------------------------
        Date: Mon, 16 Feb 2009 10:49:53 -0700
        From: Mark Davies <Mark_Davies at byu.edu>
        Subject: RE: [Humanist] 22.535 a control corpus of 19th century Americanintellectual writing?
        In-Reply-To: <20090216120717.40E3D2DB4B at woodward.joyent.us>


You might try the various "Making of America" collections:

http://moa.umdl.umich.edu/
http://cdl.library.cornell.edu/moa/

as well as some of the other collections from:

http://memory.loc.gov/ammem/index.html

Also, I know it's later than the 1800s, but you might look at the 100 million word TIME Corpus (1920s-present): http://corpus.byu.edu/time/ .

Finally, I'm working on a 300 million word corpus of historical American English (early 1800s-present time), which will be balanced between fiction, non-fiction, newspapers, and popular magazines. It will complement the nearly 400 million word Corpus of Contemporary American English:

http://www.americancorpus.org

But this historical corpus is dependent on funding, and isn't available yet.

Best,

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================







More information about the Humanist mailing list