[Humanist] 29.17 billions of pages' worth

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue May 12 07:29:19 CEST 2015


                  Humanist Discussion Group, Vol. 29, No. 17.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Fabio Ciotti <fabio.ciotti at uniroma2.it>                   (20)
        Subject: Re:  29.14 billions of pages' worth: sign of the times?

  [2]   From:    "Downie, J Stephen" <jdownie at illinois.edu>                (12)
        Subject: RE:  29.14 billions of pages' worth: sign of the times?


--[1]------------------------------------------------------------------------
        Date: Mon, 11 May 2015 10:16:41 +0200
        From: Fabio Ciotti <fabio.ciotti at uniroma2.it>
        Subject: Re:  29.14 billions of pages' worth: sign of the times?
        In-Reply-To: <20150510061135.A040A669C at digitalhumanities.org>


>
> Some of my colleagues may be interested to note that all of this vast
> collection of metadata is in JSON, not XML format.
> Due to access restrictions I couldn't apparently download any actual text,
> but the format of that seems to be plain text or PDF.
> Time to move on from XML?
>

Maybe, maybe not, I personally cannot see where Json is better than XML,
and overall where is really different from XML in the competence of the
average user.

That said, and I hope not to raise again this rather boring war of religion
that goes on since 1986 (the date of SGML standardization, only to fix a
conventional kick off...), I wonder if all of these (meta)data are really
of any interest for a literary scholar? Is this big data deluge that we can
play with using purely quantitative methods, giving us any insight about
texts? Out of the hype, I really would like to know if someone on the
community of digital literary scholar is really thinking about the adequacy
of these methods. Of course I do not want to raise another war of religion,
just a good ole controversy based on argumentation.

Fabio



--[2]------------------------------------------------------------------------
        Date: Tue, 12 May 2015 00:39:05 +0000
        From: "Downie, J Stephen" <jdownie at illinois.edu>
        Subject: RE:  29.14 billions of pages' worth: sign of the times?
        In-Reply-To: <20150510061135.A040A669C at digitalhumanities.org>


Dear Dr. Schmidt and Colleagues:

Thanks for your feedback, Desmond, on our recent data release. We are quite excited to be able to release this data for use by scholars everywhere. The recent release represents early days for what we hope will be an ongoing aspect of our work at the HathiTrust Research Center (HTRC). We are learning by doing. We welcome each and every comment, suggestion and question so we can make subsequent releases as useful as possible.

The tech team at HTRC chose JSON for this release for its relative simplicity and its relative ease-of-use in processing/parsing the data since the format is basically name-value pairs variously nested. Also, like with many projects, JSON was a format with which members of the tech team felt quite comfortable, having used the format before in other tasks. I mention this to let folks know that we are actually rather agnostic as to possible formats for future releases. We are open to all suggestions and ideas. 

Notwithstanding that the underlying volumes from which we derived our extracted features metadata are in the "public domain", agreements with the parties that did the scanning, along with variations in international copyright laws with regard to public domain status determinations, preclude the HTRC from actually delivering the underlying text and page images to the community. Individual works most likely can be viewed, however, at the HathiTrust Digital Library (http://hathitrust.org/) using the Volume ID as key to finding the specific volume in question. 

The HathiTrust does have a mechanism for scholars to request public domain datasets for specific research projects. If interested, I recommend that you visit http://www.hathitrust.org/datasets  for more information.

As time progresses, it our goal to evolve the types of extracted features we share. At the same time, we plan to develop tools to make selecting and downloading subsets of features and volumes easier. Finally, we also hope to begin releasing features from copyright-restricted works as part of the HTRC's "non-consumptive research" framework. This way we can assist the community in making analytic and scholarly use of the remaining ~9 million volumes/~3 billion pages the HT digital library!

I hope this has helped to clarify things a bit. If not, please drop me or the HTRC a line and we will try to make things clearer. 

Cheers and thanks,
Stephen




More information about the Humanist mailing list