[Humanist] 29.127 use of n-grams

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Jun 29 23:30:32 CEST 2015

                 Humanist Discussion Group, Vol. 29, No. 127.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Tue, 30 Jun 2015 01:54:34 +1000
        From: Sayan Bhattacharyya <bhattach at umich.edu>
        Subject: uses of n-grams?

> Date: Fri, 12 Jun 2015 12:00:53 +0000
> From: Andrew Prescott <Andrew.Prescott at glasgow.ac.uk
>  http://lists.digitalhumanities.org/mailman/listinfo/humanist >
> Subject: Re: 29.94 uses of n-grams?

> Andreas Jucker, Irma Taavitsainen and Gerold Schneider, "Semantic corpus
> trawling: Expressions of “courtesy” and “politeness” in the Helsinki
> Corpus”, Varieng: Studies in Variation, Contacts and Change in English 11
> (2012), available at
> http://www.helsinki.fi/varieng/series/volumes/11/jucker_taavitsainen_schneider/
> makes use of Google N-gram in studying chronological shifts in cultural
> constructions of politeness. However, the addendum to the article reveals
> some of the hazards of using the Google N-Gram viewer. It was found that
> some of the shifts in word use indicated by Google N-Gram were due to the
> decline of the use of the long ‘f’ and were thus typographical artefacts
> rather than cultural changes. When an attempt was made to recalculate the
> results, it was found that Google had changed its algorithm, so that the
> original results could not be repeated.

> A Conversation with Data: Prospecting Victorian Words and Ideas
> Gibbs, Frederick W; Cohen, Daniel J. Victorian Studies54.1 (Autumn 2011):
> 69-77,185.

Sorry for responding a little late... in this connection, I would like to
mention that some of the researchers (from the Culturomics Lab) who were
involved in creating the Google N-gram Viewer that Andrew refers to above,
are currently collaborating with us at the HathiTrust Research Center on an
NEH-funded project called the HathiTrust+Bookworm project.

This project is intended to plot lexical trends against the HathiTrust
corpus (which is quite large, currently about 14 million volumes of
digitized text), although our current prototype is set up to run against
only pre-1923 volumes, that is about 4 million volumes. The nice thing is
that, since the HathiTrust Corpus comes accompanied by quite substantial
bibliographic metadata for most volumes, our project is leveraging that
metadata for faceted search that allows for plotting lexical trends within
fairly well-focused subsets of the collection as defined by the metadata
criteria specified by the user.

In the near future (possibly in a few months), we expect to have
implemented the capability to plot the graph against specific, custom
"worksets" that the user can carefully curate (using metadata criteria and
optionally culling the returned results by hand). This will allow for
n-gram analysis (for literary purposes)  at grain sizes as small as a
single volume, and as large as the entire HathiTrust corpus —  everything
in between. The paper by Gibbs and Cohen mentioned by Andrew above actually
served as an inspiration for the project — to some degree, the project is
an attempt to fulfill the desiderata that Dan Cohen mentions in that paper
as worthwhile having.

More detailed explanation, including a link to our current prototype, can
be found at  HathiTrust+Bookworm project blog,
https://htrcbookworm.wordpress.com . The current prototype works with
individual words, but this will be extended to n-grams with somewhat higher
values of n in the near future.


Sayan Bhattacharyya
CLIR Postdoctoral Research Fellow
HathiTrust Research Center
University of Illinois, Urbana-Champaign
sayan at illinois.edu

More information about the Humanist mailing list