[Humanist] 29.134 use of n-grams

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Jul 1 23:30:35 CEST 2015

                 Humanist Discussion Group, Vol. 29, No. 134.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Wed, 1 Jul 2015 01:23:25 +0000
        From: Mark Davies <Mark_Davies at byu.edu>
        Subject: Re:  29.127 use of n-grams
        In-Reply-To: <20150629213032.B525B2C84 at digitalhumanities.org>

Sayan Bhattacharyya wrote:
>>  I would like to mention that some of the researchers (from the Culturomics Lab) who were
involved in creating the Google N-gram Viewer that Andrew refers to above,
are currently collaborating with us at the HathiTrust Research Center on an
NEH-funded project called the HathiTrust+Bookworm project.

For those who are interested in the Google Books n-grams, I might suggest:

-- http://googlebooks.byu.edu
This interface uses the same n-grams dataset as the "standard interface", but it allows much more powerful searching -- finding collocates (to look at cultural shifts in much more meaningful ways than the simple Culturomics approach), comparing frequency of all words by time period, more powerful part of speech and lemmatization, integration with semantic resources, etc

For a quick overview, see: http://googlebooks.byu.edu/compare-googleBooks.asp. For much more detail:

Davies, Mark. (2014) “Making Google Books n-grams useful for a wide range of research on language change”. International Journal of Corpus Linguistics 19 (3): 401-16.


Of course these are just n-grams (1-5 word strings; no other searchable context). And as most people are aware, it only includes n-grams that occur 40 times or more, which probably eliminates 90-95% of all *types* (not tokens) for 3, 4, and 5-grams.

The largest "structured" historical corpus (actual sentences, paragraphs, etc -- not just n-grams) is the Corpus of Historical American English (COHA):


The n-grams from this corpus are freely available:


In addition, it is possible to get the full 400 million word corpus:



Mark Davies

Mark Davies
Professor of Linguistics / Brigham Young University

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

More information about the Humanist mailing list