[Humanist] 30.547 open source for scanning to row/column

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Dec 7 11:50:08 CET 2016


                 Humanist Discussion Group, Vol. 30, No. 547.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Maximilian Schich <maximilian at schich.info>                (21)
        Subject: Re:  30.541 open-source for scanning to row/column format?

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (37)
        Subject: Re:  30.541 open-source for scanning to row/column format?


--[1]------------------------------------------------------------------------
        Date: Tue, 6 Dec 2016 01:25:22 -0600
        From: Maximilian Schich <maximilian at schich.info>
        Subject: Re:  30.541 open-source for scanning to row/column format?
        In-Reply-To: <20161206065338.3CA7F764 at digitalhumanities.org>


For the OCR part: 
https://opensource.com/life/15/9/open-source-extract-text-images

For the table part: Recent open spreadsheet software should work with a 
couple of 100k lines.
Alternatively, this is great: http://datascienceatthecommandline.com/

mxs

On 2016-12-06 00:53, Humanist Discussion Group wrote:
>                   Humanist Discussion Group, Vol. 30, No. 541.
>              Department of Digital Humanities, King's College London
>                         www.digitalhumanities.org/humanist
>                  Submit to: humanist at lists.digitalhumanities.org
>
>
>
>          Date: Mon, 5 Dec 2016 21:43:27 +0000
>          From: Drew VandeCreek <drew at niu.edu>
>          Subject: question
>
>
> I have a Digital Humanities student who is adept at computer programming.  He wants to work with analog/print text sources that are in table format (ex. historical population tables not published by the census bureau).  His goal is to be able to take scanned images of the pages with tables of data on them and have the tables' contents accessible in row/column formats.   He has seen some commercial products that can do this (Abbey FineReader -> Excel), but wonders if there are open source or DH-friendly projects.  He hopes to do this on a massive scale (1000s of pages) and then text-mine the output into his database.




--[2]------------------------------------------------------------------------
        Date: Tue, 6 Dec 2016 21:36:37 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  30.541 open-source for scanning to row/column format?
        In-Reply-To: <20161206065338.3CA7F764 at digitalhumanities.org>


The only open source OCR packages I know are Tesseract and Ocropus. But
Tesseract at least doesn't do columns. What you might try is slicing the
image of the page into strips using imagemagick, then running the strips
through Tesseract. Then you could reassemble them page by page using a
script. Imagemagick is a commandline tool. I've used it to slice images
into halves, but if your columns are regular enough I don't see why you
can't use it to help recognise spreadsheet data:
#!/bin/bash
for f in *.jpg
  do
    convert -crop 50%x100% $f +repage $f
  done
This splits all .jpg files into two. Adapt it to split into more columns



More information about the Humanist mailing list