[Humanist] 27.495 big data is bunk

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Nov 1 07:05:15 CET 2013

                 Humanist Discussion Group, Vol. 27, No. 495.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Thu, 31 Oct 2013 02:57:25 -0500
        From: Anupam Basu <philhellene at yahoo.com>
        Subject: Re:  27.490 big data is bunk?
        In-Reply-To: <20131031065145.275CC76A7 at digitalhumanities.org>

    "The exciting thing is you can get a lot of this stuff done just in
    Excel," he said. "You don't need these big platforms. You don't need
    all this big fancy stuff. If anyone says 'big' in front of it, you
    should look at them very skeptically ... You can tell charlatans
    when they say 'big' in front of everything."

While there is, of course, some debate on how to define the phrase "big 
data," it is thrown about rather recklessly in the media. The prevalent 
opinion focuses on computing resources as a measure of "bigness" - i.e. 
your data isn't big unless you're talking about several terabytes or 
even petabytes that needs distributed clusters, map-reduce etc to 
process  http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html . I 
know of no humanities projects that need to operate at this scale. I 
don't know about Excel, but most of what we do is certainly tractable 
with fairly moderate computing power. But if we are to stick to hardware 
resources as a frame of reference, I'd define an intermediate notion of 
scale as the point where your data is large enough that you 
can't simply brute-force your way through it - where you have to think 
about data-structures, memory usage and optimizing algorithms. With a 
corpus of several thousand texts - it's easy to hit this threshold with a 
single computer where you start to run out of RAM and have to serialize 
things, or where brute-force searches simply don't scale and you have to 
look at new tools and algorithms. Lying between the domains of Excel and 
Hadoop, this is where a lot of humanities "big data" analytics happens.

On the other hand, one might argue that all data is big data in the 
humanities. That is, the moment we enter the realm of "data" in the 
humanities - the moment we scale up from the conventional logic and 
practices of reading and start to think in terms of corpora and 
corpus-wide analysis - we enter a domain that might not stretch 
computing hardware of even Excel, but that requires us to rethink and 
fundamentally reevaluate paradigmatic assumptions about reading and 
analysis. Conventional big data that uses distributed processing 
requires a radical rethink of how computational resources are used and 
how large-scale analysis is broken down into tractable chunks. I'd say 
that the jump from 'close' to 'distant' reading and the translation of 
qualitative experience into quantitative data is no less radical. If the 
phrase were not so co-opted by commercial gimmicks, I'd be happy to 
settle on this paradoxical notion of computationally small data that is 
"big" from a humanistic perspective.


Washington University in Saint Louis
Interdisciplinary Project in the Humanities
Humanities Digital Workshop

More information about the Humanist mailing list