[Humanist] 31.466 extracting text from websites for analysis?
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Fri Dec 15 07:27:03 CET 2017
Humanist Discussion Group, Vol. 31, No. 466.
Department of Digital Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Wed, 13 Dec 2017 16:25:54 -0500
From: alex at ethnographer.ca
Subject: Websites' Text Content, Practical Concerns for Analysis
Greetings, fellow humanists (and Humanists)!
Anyone has insight on extracting the core textual content from websites
and feeding that into, say, Voyant Tools?
Some DH tools and methods, including topic modelling, have made their
way into my dayjob with engineers, entrepreneurs, and other
technologists. Wishing to provide further DH insight, was tasked with
applying similar strategies to semi-automatic analysis of organizational
websites. Eventually, we'd like to identify some tokens which could help
reveal 'cultural dimensions'. Sounds like a tall order, but the process
can lead us in interesting directions. (The challenge in defining
cultural dimensions is a completely separate story, though assumptions
really affect data collection in a case like this.)
For a number of reasons (Carleton's Shawn Graham is partly responsible),
feeding things into Voyant Tools sounds like the most appropriate
approach, at this point. However, what's expected as documents by VT
needs to be prepared carefully.
One of my attempts was to produce sitemaps in RSS format and give those
for VT to chew on. That made VT choke, maybe because some of the links
were to feeds themselves. Tried adding URLs in smaller batches and that
did produce interesting results, but there are still issues in getting
VT to create a corpus from links without massaging. Maybe that part is
specific to Voyant Tools and the community there could help.
The broader issue is that the webpages contain a lot of extraneous
content, including navigation, headers, and footers. There are APIs and
tools out there to parse webpages and extract the main content
(Boilerpipe, Goose Extractor, Mercury Web Parser...). Haven't been that
successful with them, yet. People have suggested Portia for a code-free
approach, but that actually sounds quite involved. Others advise me to
learn to do it in Python and/or through XPath (VT support would probably
help, then). But it's a bit hard to locate the first step in
self-training for this (my minimal Python skills mostly have to do with
Raspberry Pi and musicking).
We might actually skip this whole microproject, if it requires too much
effort. Just thought it could be a neat opportunity to bring DH (and
humanism) to our work.
So any insight would be greatly appreciated.
Learning Pathways Strategist, Global Cybersecurity Resource, Carleton
Part-Time Professor, School of Sociological and Anthropological Studies,
University of Ottawa
More information about the Humanist