[Humanist] 26.732 text-analysis in the news

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Jan 31 08:21:54 CET 2013

                 Humanist Discussion Group, Vol. 26, No. 732.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Wed, 30 Jan 2013 23:17:36 -0200
        From: "dennis c.l." <cyberdennis at gmail.com>
        Subject: Re:  26.722 memory projects
        In-Reply-To: <20130128061717.01096F85 at digitalhumanities.org>

Dear Willard,
interesting article published in the New York Times
Prof. dennis c.l. - retired

Dickens, Austen and Twain, Through a Digital Lens

    Jan. 26, 2013

ANY list of the leading novelists of the 19th century, writing in English,
would almost surely include Charles Dickens, Thomas Hardy, Herman Melville,
Nathaniel Hawthorne and Mark Twain.

But they do not appear at the top of a list of the most influential writers
of their time. Instead, a recent study has found, Jane Austen, author of
“Pride and Prejudice, “ and Sir Walter Scott, the creator of “Ivanhoe,” had
the greatest effect on other authors, in terms of writing style and themes.

These two were “the literary equivalent of Homo erectus, or, if you prefer,
Adam and Eve,” Matthew L. Jockers wrote in research published last year. He
based his conclusion on an analysis of 3,592 works published from 1780 to
1900. It was a lot of digging, and a computer did it.

The study, which involved statistical parsing and aggregation of thousands
of novels, made other striking observations. For example, Austen’s works
cluster tightly together in style and theme, while those of George Eliot (a
k a Mary Ann Evans) range more broadly, and more closely resemble the
patterns of male writers. Using similar criteria, Harriet Beecher Stowe was
20 years ahead of her time, said Mr. Jockers, whose research will soon be
published in a book, “Macroanalysis: Digital Methods and Literary History”
(University of Illinois Press).

These findings are hardly the last word. At this stage, this kind of
digital analysis is mostly an intriguing sign that Big Data technology is
steadily pushing beyond the Internet industry and scientific research into
seemingly foreign fields like the social sciences and the humanities. The
new tools of discovery provide a fresh look at culture, much as the
microscope gave us a closer look at the subtleties of life and the
telescope opened the way to faraway galaxies.

“Traditionally, literary history was done by studying a relative handful of
texts,” says Mr. Jockers, an assistant professor of English and a
researcher at the Center for Digital Research in the Humanities at the
University of Nebraska. “What this technology does is let you see the big
picture — the context in which a writer worked — on a scale we’ve never
seen before.”

Mr. Jockers, 46, personifies the digital advance in the humanities. He
received a Ph.D. in English literature from Southern Illinois University,
but was also fascinated by computing and became a self-taught programmer.
Before he moved to the University of Nebraska last year, he spent more than
a decade at Stanford, where he was a founder of the Stanford Literary Lab,
which is dedicated to the digital exploration of books.

Today, Mr. Jockers describes the tools of his trade in terms familiar to an
Internet software engineer — algorithms that use machine learning and
network analysis techniques. His mathematical models are tailored to
identify word patterns and thematic elements in written text. The number
and strength of links among novels determine influence, much the way Google
ranks Web sites.

It is this ability to collect, measure and analyze data for meaningful
insights that is the promise of Big Data technology. In the humanities and
social sciences, the flood of new data comes from many sources including
books scanned into digital form, Web sites, blog posts and social network

Data-centric specialties are growing fast, giving rise to a new vocabulary.
In political science, this quantitative analysis is called political
methodology. In history, there is cliometrics, which applies econometrics
to history. In literature, stylometry is the study of an author’s writing
style, and these days it leans heavily on computing and statistical
analysis. Culturomics is the umbrella term used to describe rigorous
quantitative inquiries in the social sciences and humanities.

“Some call it computer science and some call it statistics, but the essence
is that these algorithmic methods are increasingly part of every discipline
now,” says Gary King, director of the Institute for Quantitative Social
Science at Harvard.

Cultural data analysts often adapt biological analogies to describe their
work. Mr. Jockers, for example, called his research presentation “Computing
and Visualizing the 19th-Century Literary Genome.”

Such biological metaphors seem apt, because much of the research is a
quantitative examination of words. Just as genes are the fundamental
building blocks of biology, words are the raw material of ideas.

“What is critical and distinctive to human evolution is ideas, and how they
evolve,” says Jean-Baptiste Michel, a postdoctoral fellow at Harvard.

Mr. Michel and another researcher, Erez Lieberman Aiden, led a project to
mine the virtual book depository known as Google Books and to track the use
of words over time, compare related words and even graph them.

Google cooperated and built the software for making graphs open to the
public. The initial version of Google’s cultural exploration site began at
the end of 2010, based on more than five million books, dating from 1500.
By now, Google has scanned 20 million books, and the site is used 50 times
a minute. For example, type in “women” in comparison to “men,” and you see
that for centuries the number of references to men dwarfed those for women.
The crossover came in 1985, with women ahead ever since.

In work published in Science magazine in 2011, Mr. Michel and the research
team tapped the Google Books data to find how quickly the past fades from
books. For instance, references to “1880,” which peaked in that year, fell
to half by 1912, a lag of 32 years. By contrast, “1973” declined to half
its peak by 1983, only 10 years later. “We are forgetting our past faster
with each passing year,” the authors wrote.

JON KLEINBERG, a computer scientist at Cornell, and a group of researchers
approached collective memory from a very different perspective.
 Their work, published last year, focused on what makes spoken lines in
movies memorable. Sentences that endure in the public mind are evolutionary
success stories, Mr. Kleinberg says, comparing “the fitness of language and
the fitness of organisms.”

As a yardstick, the researchers used the “memorable quotes” selected from
the popular Internet Movie Database, or IMDb, and the number of times that
a particular movie line appears on the Web. Then they compared the
memorable lines to the complete scripts of the movies in which they
appeared — about 1,000 movies.

To train their statistical algorithms on common sentence structure, word
order and most widely used words, they fed their computers a huge archive
of articles from news wires. The memorable lines consisted of surprising
words embedded in sentences of ordinary structure. “We can think of
memorable quotes as consisting of unusual word choices built on a
scaffolding of common part-of-speech patterns,” their study said.

Consider the line “You had me at hello,” from the movie “Jerry Maguire.” It
is, Mr. Kleinberg notes, basically the same sequence of parts of speech as
the quotidian “I met him in Boston.” Or consider this line from “Apocalypse
Now”: “I love the smell of napalm in the morning.” Only one word separates
that utterance from this: “I love the smell of coffee in the morning.”

This kind of analysis can be used for all kinds of communications,
including advertising. Indeed, Mr. Kleinberg’s group also looked at ad
slogans. Statistically, the ones most similar to memorable movie quotes
included “Quality never goes out of style,” for Levi’s jeans, and “Come to
Marlboro Country,” for Marlboro cigarettes.

But the algorithmic methods aren’t a foolproof guide to real-world success.
One ad slogan that didn’t fit well within the statistical parameters for
memorable lines was the Energizer batteries catchphrase, “It keeps going
and going and going.”

Quantitative tools in the humanities and the social sciences, as in other
fields, are most powerful when they are controlled by an intelligent human.
Experts with deep knowledge of a subject are needed to ask the right
questions and to recognize the shortcomings of statistical models.

“You’ll always need both,” says Mr. Jockers, the literary quant. “But we’re
at a moment now when there is much greater acceptance of these methods than
in the past. There will come a time when this kind of analysis is just part
of the tool kit in the humanities, as in every other discipline.”

More information about the Humanist mailing list