[Humanist] 27.327 pubs: memory (co-evolutionary); big data

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Sep 11 07:58:39 CEST 2013

                 Humanist Discussion Group, Vol. 27, No. 327.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Wed, 11 Sep 2013 06:48:24 +0100
        From: Willard McCarty <willard.mccarty at mccarty.org.uk>
        Subject: books: memory; big data

Two books to draw to your attention.

(1) Belinda Barnet, Memory Machines: The Evolution of Hypertext (London: 
Anthem, 2013).

I've only begun to read this book, but already from Chapter 1, 
"Technical Evolution", it's clear that the book is worth several 
candles. As some here will know, the term "evolution" has been used 
rather sloppily to describe how machines develop over time, 
"co-evolution" how we and our machines do it together. Barnet's 
meditations on this term, with the help of palaeobiologist Niles 
Eldredge (who worked up the idea of "punctuated equilibrium" with 
Stephen Jay Gould as an alternative to gradualism in evolution), is a 
major contribution on our thinking about what's happening to us now, and 
what's always been happening since we started inventing tools.

(2) Frontiers in Massive Data Analysis (Washington DC: National 
Academies Press, 2013), downloadable from 

Also just encountered. It would appear that unsurprisingly the 
humanities are not considered, but here, also unsurprisingly, we can 
learn from the sciences. This following, for example, caught my eye:

> It is natural to be optimistic about the prospects.... However, such
> optimism must be tempered by an understanding of the major
> difficulties that arise in attempting to achieve the envisioned
> goals. In part, these difficulties are those familiar from
> implementations of large-scale databases—finding and mitigating
> bottlenecks.... But the challenges for massive data go beyond the
> storage, indexing, and querying that have been the province of
> classical database systems (and classical search engines) and,
> instead, hinge on the ambitious goal of inference. Inference is the
> problem of turning data into knowledge, where knowledge often is
> expressed in terms of entities that are not present in the data per
> se but are present in models that one uses to interpret the data.
> Statistical rigor is necessary to justify the inferential leap from
> data to knowledge, and many difficulties arise in attempting to bring
> statistical principles to bear on massive data. Overlooking this
> foundation may yield results that are not useful at best, or harmful
> at worst. In any discussion of massive data and inference, it is
> essential to be aware that it is quite possible to turn data into
> something resembling knowledge when actually it is not. Moreover, it
> can be quite difficult to know that this has happened.
> Indeed, many issues impinge on the quality of inference. A major one
> is that of “sampling bias.” .... Another major issue is “provenance.”
> Many systems involve layers of inference, where “data” are not the
> original observations but are the products of an inferential
> procedure of some kind.... Finally, there is the major issue of
> controlling error rates when many hypotheses are being considered.
> Indeed, massive data sets generally involve growth not merely in the
> number of individuals represented (the “rows” of the database) but
> also in the number of descriptors of those individuals (the “columns”
> of the database). Moreover, we are often interested in the predictive
> ability associated with combinations of the descriptors; this can
> lead to exponential growth in the number of hypotheses considered,
> with severe consequences for error rates. That is, a naive appeal to
> a “law of large numbers” for massive data is unlikely to be
> justified; if anything, the perils associated with statistical
> fluctuations may actually increase as data sets grow in size. While
> the field of statistics has developed tools that can address such
> issues in principle, in the context of massive data care must be
> taken with all such tools for two main reasons: (1) all statistical
> tools are based on assumptions about characteristics of the data set
> and the way it was sampled, and those assumptions may be violated in
> the process of assembling massive data sets; and (2) tools for
> assessing errors of procedures, and for diagnostics, are themselves
> computational procedures that may be computationally infeasible as
> data sets move into the massive scale.

Comments are most welcome.

Willard McCarty (www.mccarty.org.uk/), Professor, Department of Digital
Humanities, King's College London, and Research Group in Digital
Humanities, University of Western Sydney

More information about the Humanist mailing list