[Humanist] 25.15 new on WWW: 155 billion American English words

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri May 13 07:16:04 CEST 2011

                  Humanist Discussion Group, Vol. 25, No. 15.
         Centre for Computing in the Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Thu, 12 May 2011 17:11:06 -0600
        From: Mark Davies <Mark_Davies at byu.edu>
        Subject: 155 *billion* (155,000,000,000) word corpus of American English

(Apologies for cross-postings)

We’re pleased to announce a new corpus -- the Google Books (American English) corpus: http://googlebooks.byu.edu/.

This corpus is based on the American English portion of the Google Books data (see http://ngrams.googlelabs.com and especially http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words  (155,000,000,000) in more than 1.3 million books from the 1810s-2000s (including 62 billion words from just 1980-2009).

The corpus has most of the functionality of the other corpora from http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), including: searching by part of speech, wildcards, and lemma (and thus advanced syntactic searches), synonyms, collocate searches, frequency by decade (tables listing each individual string, or charts for total frequency), comparisons of two historical periods (e.g. collocates of "women" or "music" in the 1800s and the 1900s), and more.

This American English corpus is just one of seven Google Books-based corpora that we hope to create in the next year or two (contingent on funding, which we are applying for in June 2011). If funded, the other corpora will include British English, English from the 1500s-1700s, and corpora of Spanish, French, and German (see the listing at http://ngrams.googlelabs.com/datasets).  Each of these corpora will be based on at least 50 billion words of data, and they should represent a nice addition to existing resources.

The Google Books (American English) corpus is freely-available at http://googlebooks.byu.edu, and we hope that it is of value to you in your research and teaching.


Mark Davies

Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

More information about the Humanist mailing list