[Humanist] 28.171 text for text mining

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Jul 2 23:42:11 CEST 2014


                 Humanist Discussion Group, Vol. 28, No. 171.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Alexander O'Connor <Alex.OConnor at scss.tcd.ie>             (46)
        Subject: Re:  28.168 text for text mining?

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (63)
        Subject: Re:  28.168 text for text mining?

  [3]   From:    "Robert A. Amsler" <amsler at cs.utexas.edu>                 (54)
        Subject: drew at niu.edu

  [4]   From:    "Dave Postles" <davep at davelinux.info>                      (6)
        Subject: Re:  28.168 text for text mining?

  [5]   From:    Patrick Durusau <patrick at durusau.net>                     (62)
        Subject: Re:  28.168 text for text mining?


--[1]------------------------------------------------------------------------
        Date: Tue, 1 Jul 2014 21:54:09 +0100
        From: Alexander O'Connor <Alex.OConnor at scss.tcd.ie>
        Subject: Re:  28.168 text for text mining?
        In-Reply-To: <20140701203626.22B10626E at digitalhumanities.org>


You might ask them for the transcriptions or plaintext versions. It would be wise also to discover if that was performed manually, by edited optical character recognition or by purely automated means. 

--
Dr. Alexander O'Connor
Research Fellow CNGL
KDEG, Trinity College Dublin
Ireland 

> On 1 Jul 2014, at 21:36, Humanist Discussion Group <willard.mccarty at mccarty.org.uk> wrote:
> 
>                 Humanist Discussion Group, Vol. 28, No. 168.
>            Department of Digital Humanities, King's College London
>                       www.digitalhumanities.org/humanist
>                Submit to: humanist at lists.digitalhumanities.org
> 
> 
> 
>        Date: Tue, 01 Jul 2014 14:14:34 -0500
>        From: "Drew VandeCreek" <drew at niu.edu>
>        Subject: text mining
> 
> 
> I am a historian trying to figure out how to do text mining. In this case I am working with nineteenth-century American newspapers. 
> 
> I recently contacted a library that makes a Civil War-era newspaper available in searchable format for use on (brick and mortar) site, and asked them for permission to work with materials from 1861-1865. 
> 
> After we negotiated a brief agreement setting out terms of use, they sent me the files. The problem is that they sent me a TIF-format image for every page. I had asked for the text-format versions of the files.
> 
> I am now making sure that I can be clear about what I am requesting when I follow up with them. 
> 
> It is my understanding that if a textual resource is to be searched in any effective sense, the software must work with the material in a text format. 
> 
> Thus, if the lending library presents searchable textual materials, they must have a text-format file on hand. 
> 
> Should I move forward with this assumption?
> 
> 
> Please advise. 
> 
> 
> 
> Drew E. VandeCreek
> Director of Digital Initiatives 
> 
> University Libraries
> Northern Illinois University
> DeKalb, IL 60115
> (815) 753-7179



--[2]------------------------------------------------------------------------
        Date: Wed, 2 Jul 2014 06:55:40 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  28.168 text for text mining?
        In-Reply-To: <20140701203626.22B10626E at digitalhumanities.org>


Hi Drew,

my guess is that they don't have those years transcribed yet into text
format. If you compare the Trove collection at the nla.gov.au, which has
even older newspapers digitised, the problem with OCR of texts from this
era becomes apparent. I heard that they used Abbyy FineReader to get a
rough text, and it does a very good job considering what they started with,
but it still needs editing. They use a crowd sourcing approach for that
which is very popular and successful. Alternatively you could use a tool
like Acrobat, which can do OCR directly within the image. I don't know if
it does a good enough job with those newspapers, though. And maybe that is
easier for you. But before you can do any text analysis you do need those
text files in some form.

Desmond Schmidt
Queensland University of Technology


--[3]------------------------------------------------------------------------
        Date: Tue, 1 Jul 2014 16:46:46 -0500
        From: "Robert A. Amsler" <amsler at cs.utexas.edu>
        Subject: drew at niu.edu
        In-Reply-To: <20140701203626.22B10626E at digitalhumanities.org>


The assumption that there is text underling the TIF images is correct,
however, the text may not be proofed such that it is completely accurate.
One reason sites offer up TIF images instead of text is that the text,
resulting from document imaging systems, may be damaged to the point where
isn't really that readable. They rely on there being sufficient
duplication of content words in the data to provide keywords matches.



--[4]------------------------------------------------------------------------
        Date: Tue, 1 Jul 2014 23:26:42 +0100
        From: "Dave Postles" <davep at davelinux.info>
        Subject: Re:  28.168 text for text mining?
        In-Reply-To: <20140701203626.22B10626E at digitalhumanities.org>

.pdf format is searchable in its rudimentary sense and can be exported
from graphics programs as image files.  As proof of concept, I've just
imported a .pdf into GIMP and exported it as a .tif image file.

-- 
http://www.historicalresources.myzen.co.uk (research and pedagogy)
From my Trisquel Linux desktop



--[5]------------------------------------------------------------------------
        Date: Tue, 01 Jul 2014 20:39:23 -0400
        From: Patrick Durusau <patrick at durusau.net>
        Subject: Re:  28.168 text for text mining?
        In-Reply-To: <20140701203626.22B10626E at digitalhumanities.org>


Drew,


Yes, TIFF files must have OCR peformed on them prior to searching.

However, that is easy for MS Windows XP or later software. One support
note on the process can be found at:
http://office.microsoft.com/en-us/help/about-indexing-text-in-tiff-and-mdi-files-HP003081236.aspx

There are any number of tiff OCR services, some for free on the web.

The library in question may have a text file or they may have embedded
the results of OCR in the TIFF files already. Would be worth asking.

Best of luck with the project!

Patrick

- -- 
Patrick Durusau
patrick at durusau.net
Technical Advisory Board, OASIS (TAB)
Co-Chair, OpenDocument Format TC (OASIS)
Editor, OpenDocument Format TC, Project Editor ISO/IEC 26300
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Co-Editor, ISO 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau




More information about the Humanist mailing list