[Humanist] 26.925 a dirty historical dataset?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sat Mar 30 09:42:12 CET 2013

                 Humanist Discussion Group, Vol. 26, No. 925.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Fri, 29 Mar 2013 10:32:00 +0100
        From: Seth van Hooland <svhoolan at ulb.ac.be>
        Subject: dirty historical datasets

Dear all,

I'm currently preparing a lesson on the topic of data quality for http://programminghistorian.org/ in order to demonstrate how historians can use data profiling techniques to diagnose and enhance the quality of source materials. 

If you are aware of a particularly interesting historical dataset which could be used as a case-study for this lesson, please get in touch. Two conditions: 1) the bigger the dataset, the better and 2) the dataset should be made available through the http://creativecommons.org/licenses/by/2.0/ license.    

A concrete example: by using faceting and clustering techniques, researchers interested in the analyzing the different ships described in the http://www.slavevoyages.org database can cluster together the same realities which are interpreted as different due to spelling or character encoding differences. The screenshot available on http://homepages.ulb.ac.be/~svhoolan/clusters.jpg illustrates this approach.

Kind regards,

Seth van Hooland
Président du Master en Sciences et Technologies de l'Information et de la Communication (MaSTIC)
Université Libre de Bruxelles
Av. F.D. Roosevelt, 50 CP 123  | 1050 Bruxelles
0032 2 650 4765
Office: DC11.102

More information about the Humanist mailing list