[Humanist] 29.14 billions of pages' worth: sign of the times?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sun May 10 08:11:35 CEST 2015


                  Humanist Discussion Group, Vol. 29, No. 14.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Sun, 10 May 2015 06:51:18 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  29.11 billions of pages' worth
        In-Reply-To: <20150509062702.D6D2F5FA8 at digitalhumanities.org>


Some of my colleagues may be interested to note that all of this vast
collection of metadata is in JSON, not XML format.
Due to access restrictions I couldn't apparently download any actual text,
but the format of that seems to be plain text or PDF.
Time to move on from XML?

Desmond Schmidt
University of Queensland

On Sat, May 9, 2015 at 4:27 PM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                   Humanist Discussion Group, Vol. 29, No. 11.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Fri, 8 May 2015 16:05:47 +0000
>         From: "Downie, J Stephen" <jdownie at illinois.edu>
>         Subject: Extracted Features Dataset Now Available for 4.8 Million
> Volumes/1.8 Billion Pages
>         In-Reply-To: <D1724488.730F%rdubnic2 at illinois.edu>
>
> Dear Colleagues:
>
> The HathiTrust Research Center is pleased to announce the release of its
> Extracted Features Dataset (v. 0.2), a dataset derived from 4.8 million
> public domain volumes totaling 1.8 billion pages currently available in the
> HathiTrust Digital Library collection. The dataset includes over 734
> billion words, dozens of languages, and spans multiple centuries. Features
> are informative, quantified characteristics of a text, and include:
>
> *       Volume-level metadata
>
> *       Page-level features
>
>         *       Part-of-speech-tagged token counts
>
>         *       Header and footer identification
>
>         *       Sentence and line count
>
>         *       Algorithmic language detection
>
> *       Line-level features
>
>         *       Beginning and end line character count
>
>         *       Maximum length of the sequence of capital characters
> starting a line
>
> These features allow for analysis of large worksets of volumes in the
> HathiTrust public domain collection, at scales previously intractable for
> most individual researchers. For example, page-level token (word) counts,
> can be used to help build topic models, classifications and perform other
> text analytics. Similarly, features can be used to evaluate readability of
> a given volume or workset.
>
> How to get the data:
>
> The entire dataset, as well as sample subsets and custom worksets, are
> available at: https://sharc.hathitrust.org/features <
> https://sharc.hathitrust.org/features>
>
> How to cite:
>
> Boris Capitanu, Ted Underwood, Peter Organisciak, Sayan Bhattacharyya,
> Loretta Auvil, Colleen Fallaw, J. Stephen Downie (2015). Extracted Feature
> Dataset from 4.8 Million HathiTrust Digital Library Public Domain Volumes
> (v0.2). [Dataset]. HathiTrust Research Center, doi:10.13012/j8td9v7m.
>
> This feature dataset is provided under a Creative Commons Attribution 4.0
> International License.
>
> About the HathiTrust Research Center:
>
> The HTRC is a collaborative research center launched jointly by Indiana
> University and the University of Illinois, along with the HathiTrust
> Digital Library, to help meet the technical challenges of dealing with
> massive amounts of digital text that researchers face by developing
> cutting-edge software tools and cyberinfrastructure to enable advanced
> computational access to the growing digital record of human knowledge.
>
> For more information about the HathiTrust Research Center, visit
> http://www.hathitrust.org/htrc  http://www.hathitrust.org/htrc
>
> **********************************************************
>    "Research funding makes the world a better place"
> **********************************************************
> J. Stephen Downie, PhD
> Associate Dean for Research
> Professor
> Graduate School of Library and Information Science
> University of Illinois at Urbana-Champaign
> [Vox/Voicemail] (217) 649-3839






More information about the Humanist mailing list