[Humanist] 31.466 extracting text from websites for analysis?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Dec 15 07:27:03 CET 2017

                 Humanist Discussion Group, Vol. 31, No. 466.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Wed, 13 Dec 2017 16:25:54 -0500
        From: alex at ethnographer.ca
        Subject: Websites' Text Content, Practical Concerns for Analysis

Greetings, fellow humanists (and Humanists)!

Anyone has insight on extracting the core textual content from websites 
and feeding that into, say, Voyant Tools?

Some DH tools and methods, including topic modelling, have made their 
way into my dayjob with engineers, entrepreneurs, and other 
technologists. Wishing to provide further DH insight, was tasked with 
applying similar strategies to semi-automatic analysis of organizational 
websites. Eventually, we'd like to identify some tokens which could help 
reveal 'cultural dimensions'. Sounds like a tall order, but the process 
can lead us in interesting directions. (The challenge in defining 
cultural dimensions is a completely separate story, though assumptions 
really affect data collection in a case like this.)

For a number of reasons (Carleton's Shawn Graham is partly responsible), 
feeding things into Voyant Tools sounds like the most appropriate 
approach, at this point. However, what's expected as documents by VT 
needs to be prepared carefully.

One of my attempts was to produce sitemaps in RSS format and give those 
for VT to chew on. That made VT choke, maybe because some of the links 
were to feeds themselves. Tried adding URLs in smaller batches and that 
did produce interesting results, but there are still issues in getting 
VT to create a corpus from links without massaging. Maybe that part is 
specific to Voyant Tools and the community there could help.

The broader issue is that the webpages contain a lot of extraneous 
content, including navigation, headers, and footers. There are APIs and 
tools out there to parse webpages and extract the main content 
(Boilerpipe, Goose Extractor, Mercury Web Parser...). Haven't been that 
successful with them, yet. People have suggested Portia for a code-free 
approach, but that actually sounds quite involved. Others advise me to 
learn to do it in Python and/or through XPath (VT support would probably 
help, then). But it's a bit hard to locate the first step in 
self-training for this (my minimal Python skills mostly have to do with 
Raspberry Pi and musicking).

We might actually skip this whole microproject, if it requires too much 
effort. Just thought it could be a neat opportunity to bring DH (and 
humanism) to our work.

So any insight would be greatly appreciated.


Alex Enkerli
Learning Pathways Strategist, Global Cybersecurity Resource, Carleton 
Part-Time Professor, School of Sociological and Anthropological Studies, 
University of Ottawa

More information about the Humanist mailing list