[Humanist] 24.90 humanities go Google

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sun Jun 6 10:27:06 CEST 2010

                  Humanist Discussion Group, Vol. 24, No. 90.
         Centre for Computing in the Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Sat, 5 Jun 2010 05:51:00 -0600
        From: Mark Davies <Mark_Davies at byu.edu>
        Subject: RE: [Humanist] 24.86 The Humanities Go Google
        In-Reply-To: <20100605080738.A570E5A4A7 at woodward.joyent.us>


The Humanities Go Google 
By Marc Parry
Palo Alto, Calif.
>> Matthew L. Jockers may be the first English professor to assign 1,200 novels in one class...

No slight in the least to Matt, who I know is doing great work with DH. But this entire "Chronicle of Higher Education" article seems a bit strange to me (but maybe it's just because the majority of articles from the Chronicle seem strange / misguided to me in general :-).

The article makes it sound like the students at Stanford are the first to use hundreds of millions of words in text archives or corpora to look at language change and variation. Probably a surprise to the Chronicle, but those of us in corpus linguistics have been doing this for 20-25 years. Two quick examples:

-- The recently released alpha version of the 400 million word, NEH-funded Corpus of *Historical* American English (COHA; http://corpus.byu.edu/coha) has already begun to be used in ways that are at least as developed as the work at Stanford. For example, this past semester, the students in an undergraduate "capstone" course in English used the corpus to look at a wide range of linguistic shifts in American English during the past 200 years, including lexical, morphological, syntactic, and semantic change, relationship of lexical change to historical and cultural shifts, etc. The 200+ projects created by these students are online at the corpus website. In this case, they are using *140,000* texts (not just 1,200) from a wide range of genres (not just fiction) from the 1810s-2000s.

-- The Corpus of Contemporary American English (COCA; http://www.americancorpus.org) is composed of more than 400 million words in 160,000+ texts from 1990-2009, and it is used by about 55,000 *unique* users each month, many of them linguists and many in literary studies. Since it was released in 2008, it has been used as the basis for more than a hundred academic papers, journal articles, theses, etc. In addition, it is the only tool that allows researchers to look in-depth at ongoing change in a wide range of genres, dealing with many different types of change (lexical, morphological, syntactic, semantic) -- see the LLC article to appear in 1-2 months.

The Google books-based work at Stanford is exciting. However, because of the simplistic architecture and query interface used for typical Google-like queries, it cannot do (at all, or easily) a number of types of searches that can be done in 1-2 seconds with an architecture of a structured corpus like COCA and COHA:

-- find the frequency of a word, morpheme, syntactic construction, or collocates (for word meaning), decade by decade or year by year. Google News archive and Google books *can* show the frequency over time, but in far too many cases, the book/article is not really *from* that year, but rather just *refers to* that year in the book or article, so the frequency data is useless.

-- search by substring, to find variation and change with word roots, suffixes, etc (e.g. the frequency of all adjectives with the suffix *ble, decade by decade during the last 200 years)

-- search by grammatical tag, to do syntax (prescriptive or descriptive) (e.g. the rise of "going to V", who/whom in particular contexts, changes in relative pronouns, etc etc)

-- search by collocates, to see semantic change (e.g. using collocates to see new meanings or uses for words like engine, gay, green, or terrific)

-- use the integrated thesaurus and customized lists to look at semantically-driven change (e.g. all phrases related to a "family member" (mother, sister, etc) talking in a particular way (synonyms of a given verb) to someone else in the family. With a Google-like approach, you are typically looking just at *exact strings* of words.

-- limit and order the results by frequency in a given set of decades, and compare these (e.g. adjectives near "woman" in the 1880s-1920s compared to the 1960s-2000s, or which of the 20-30 synonyms of [beautiful] were much more common in the 1800s than in the 1900s (with a single one second search)

Again, all of these are doable in 1-2 seconds with a full-featured corpus architecture and interface like that of COCA or COHA, but they would be difficult or impossible with a simplistic Google-like architecture.

So while the work by Matt and colleagues is in fact quite impressive, it would have been nice if the Chronicle had done at least the minimum in terms of research to see that many, many others have already been doing similar research for a long time now.

Mark Davies

Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

More information about the Humanist mailing list