[Humanist] 26.925 a dirty historical dataset?
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Sat Mar 30 09:42:12 CET 2013
Humanist Discussion Group, Vol. 26, No. 925.
Department of Digital Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Fri, 29 Mar 2013 10:32:00 +0100
From: Seth van Hooland <svhoolan at ulb.ac.be>
Subject: dirty historical datasets
I'm currently preparing a lesson on the topic of data quality for http://programminghistorian.org/ in order to demonstrate how historians can use data profiling techniques to diagnose and enhance the quality of source materials.
If you are aware of a particularly interesting historical dataset which could be used as a case-study for this lesson, please get in touch. Two conditions: 1) the bigger the dataset, the better and 2) the dataset should be made available through the http://creativecommons.org/licenses/by/2.0/ license.
A concrete example: by using faceting and clustering techniques, researchers interested in the analyzing the different ships described in the http://www.slavevoyages.org database can cluster together the same realities which are interpreted as different due to spelling or character encoding differences. The screenshot available on http://homepages.ulb.ac.be/~svhoolan/clusters.jpg illustrates this approach.
Seth van Hooland
Président du Master en Sciences et Technologies de l'Information et de la Communication (MaSTIC)
Université Libre de Bruxelles
Av. F.D. Roosevelt, 50 CP 123 | 1050 Bruxelles
0032 2 650 4765
More information about the Humanist