[Humanist] 31.470 extracting text from websites

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sat Dec 16 09:27:13 CET 2017


                 Humanist Discussion Group, Vol. 31, No. 470.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Adam Crymble <adam.crymble at gmail.com>                     (81)
        Subject: Re:  31.466 extracting text from websites for analysis?

  [2]   From:    Susan Brown <sbrown at uoguelph.ca>                          (72)
        Subject: Re:  31.466 extracting text from websites for analysis?

  [3]   From:    "Huskey, Samuel J." <huskey at ou.edu>                       (65)
        Subject: Re: 31.466 extracting text from websites for analysis?


--[1]------------------------------------------------------------------------
        Date: Fri, 15 Dec 2017 09:26:25 +0000
        From: Adam Crymble <adam.crymble at gmail.com>
        Subject: Re:  31.466 extracting text from websites for analysis?
        In-Reply-To: <20171215062703.CD104872A at s16382816.onlinehome-server.info>


Dear Alex,

The Programming Historian (http://programminghistorian.org) has a number of
tutorials aimed at people looking to learn how ot extract text from
websites (or similar files).

I would suggest you start with the Beautiful Soup library for Python, which
is designed explicitly for your task. Jeri Wieringa has written a tutorial
on this: https://programminghistorian.org/lessons/intro-to-beautiful-soup

There are other tutorials as well, which you might find useful. But that
should be a good starting point.

Sincerely,

Adam Crymble
Editor, Programming Historian
Senior Lecturer of Digital History
University of Hertfordshire
a.crymble at herts.ac.uk

On Fri, Dec 15, 2017 at 6:27 AM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 31, No. 466.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Wed, 13 Dec 2017 16:25:54 -0500
>         From: alex at ethnographer.ca
>         Subject: Websites' Text Content, Practical Concerns for Analysis
>
>
> Greetings, fellow humanists (and Humanists)!
>
> Anyone has insight on extracting the core textual content from websites
> and feeding that into, say, Voyant Tools?
>
> Some DH tools and methods, including topic modelling, have made their
> way into my dayjob with engineers, entrepreneurs, and other
> technologists. Wishing to provide further DH insight, was tasked with
> applying similar strategies to semi-automatic analysis of organizational
> websites. Eventually, we'd like to identify some tokens which could help
> reveal 'cultural dimensions'. Sounds like a tall order, but the process
> can lead us in interesting directions. (The challenge in defining
> cultural dimensions is a completely separate story, though assumptions
> really affect data collection in a case like this.)
>
> For a number of reasons (Carleton's Shawn Graham is partly responsible),
> feeding things into Voyant Tools sounds like the most appropriate
> approach, at this point. However, what's expected as documents by VT
> needs to be prepared carefully.
>
> One of my attempts was to produce sitemaps in RSS format and give those
> for VT to chew on. That made VT choke, maybe because some of the links
> were to feeds themselves. Tried adding URLs in smaller batches and that
> did produce interesting results, but there are still issues in getting
> VT to create a corpus from links without massaging. Maybe that part is
> specific to Voyant Tools and the community there could help.
>
> The broader issue is that the webpages contain a lot of extraneous
> content, including navigation, headers, and footers. There are APIs and
> tools out there to parse webpages and extract the main content
> (Boilerpipe, Goose Extractor, Mercury Web Parser...). Haven't been that
> successful with them, yet. People have suggested Portia for a code-free
> approach, but that actually sounds quite involved. Others advise me to
> learn to do it in Python and/or through XPath (VT support would probably
> help, then). But it's a bit hard to locate the first step in
> self-training for this (my minimal Python skills mostly have to do with
> Raspberry Pi and musicking).
>
> We might actually skip this whole microproject, if it requires too much
> effort. Just thought it could be a neat opportunity to bring DH (and
> humanism) to our work.
>
> So any insight would be greatly appreciated.
>
> Thanks!
>
> --
> Alex Enkerli
> Learning Pathways Strategist, Global Cybersecurity Resource, Carleton
> University
> Part-Time Professor, School of Sociological and Anthropological Studies,
> University of Ottawa


--[2]------------------------------------------------------------------------
        Date: Fri, 15 Dec 2017 12:28:43 +0000
        From: Susan Brown <sbrown at uoguelph.ca>
        Subject: Re:  31.466 extracting text from websites for analysis?
        In-Reply-To: <20171215062703.CD104872A at s16382816.onlinehome-server.info>


Dear Alex,

It sounds like a really interesting project. In addition to looking at Voyant, you might also want to take a look at the technical approaches that Ian Milligan and his collaborators have been taking with web archives, particularly his work that involves looking at the websites of political organizations. Their current project is here: https://uwaterloo.ca/web-archive-group/

All the best,
Susan


_____________________
Susan Brown
Canada Research Chair in Collaborative Digital Scholarship
Director, Orlando Project; Project Leader, Canadian Writing Research Collaboratory
President, Canadian Society for Digital Humanities/Société canadienne des humanités numériques

Professor                                             Visiting Professor
School of English and Theatre Studies    English and Film Studies
University of Guelph                             University of Alberta
Guelph, Ontario N1G 2W1 Canada           Edmonton, Alberta T6G 2E5
519-824-4120 x53266 (office)             780-492-7803
sbrown at uoguelph.ca<mailto:sbrown at uoguelph.ca>                       susan.brown at ualberta.ca<mailto:susan.brown at ualberta.ca>
http://orlando.cambridge.org  http://www.ualberta.ca/ORLANDO   http://www.cwrc.ca


--[3]------------------------------------------------------------------------
        Date: Fri, 15 Dec 2017 16:08:28 +0000
        From: "Huskey, Samuel J." <huskey at ou.edu>
        Subject: Re: 31.466 extracting text from websites for analysis?
        In-Reply-To: <mailman.13.1513335607.17182.humanist at lists.digitalhumanities.org>


I have had some success extracting texts using the Beautiful Soup module in Python. It will allow you extract just the text without any HTML elements or extraneous code. Even with minimal Python skills, you can get a lot out of Beautiful Soup. There are some excellent tutorials available. I found the following especially helpful:


  *   https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
  *   https://programminghistorian.org/lessons/intro-to-beautiful-soup

I’ve used Beautiful Soup to extract text to use in an ElasticSearch application, so you should be able to use it for the visualization apps you mentioned.

—Sam

Samuel J. Huskey
Associate Professor and Chair
Department of Classics and Letters
University of Oklahoma
Norman, OK 73019-4042

(405) 325-0490


More information about the Humanist mailing list