[Humanist] 23.455 into the entrails of PDF

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Nov 24 07:30:08 CET 2009


                 Humanist Discussion Group, Vol. 23, No. 455.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Thomas Crombez <Thomas.Crombez at ua.ac.be>                  (60)
        Subject: Re: [Humanist] 23.451 playing with PDF guts?

  [2]   From:    Richard Lewis <richard.lewis at gold.ac.uk>                  (60)
        Subject: Re: [Humanist] 23.451 playing with PDF guts?

  [3]   From:    Elli Mylonas <elli_mylonas at brown.edu>                     (73)
        Subject: Re: [Humanist] 23.451 playing with PDF guts?

  [4]   From:    Stéfan Sinclair <sgsinclair at gmail.com>                   (31)
        Subject: Re: Playing with PDF guts?

  [5]   From:    Hugh Cayless <hcayless at email.unc.edu>                     (66)
        Subject: Re: [Humanist] 23.451 playing with PDF guts?

  [6]   From:    James Smith <jgsmith at tamu.edu>                            (62)
        Subject: Re: [Humanist] 23.451 playing with PDF guts?


--[1]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 09:46:14 +0100
        From: Thomas Crombez <Thomas.Crombez at ua.ac.be>
        Subject: Re: [Humanist] 23.451 playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>

Dear Vika,
if you know a little Python scripting (or are willing to learn it, it's quite fun and very accessible) you could look into the text layer of your pdf files using the module pyPdf (see http://pybrary.net/pyPdf/). It lets you 'grab' and modify pages, page numbers, text contents... from pdf files. A workflow specifically for editing text may be found here: http://code.activestate.com/recipes/511465/

Best, Thomas Crombez

On 23-nov-2009, at 08:34, Humanist Discussion Group wrote:

> 
>                 Humanist Discussion Group, Vol. 23, No. 451.
>         Centre for Computing in the Humanities, King's College London
>                       www.digitalhumanities.org/humanist
>                Submit to: humanist at lists.digitalhumanities.org
> 
> 
> 
>        Date: Fri, 20 Nov 2009 12:48:55 -0500
>        From: Vika Zafrin <vzafrin at bu.edu>
>        Subject: Playing with PDF guts?
> 
> 
> Dear Humanists,
> 
> I had a feeling of having asked this question here before, but can't find it
> in my email archives; please forgive any duplication.
> 
> I'm having a persistent desire to semantically encode text embedded in PDF
> files, particularly OCR'd files.  There's got to be a... layer?... in that
> format where it's just the text.  Or perhaps text with styling.  I'd like to
> get to that layer, be able to *see* it, and then figure out how to (for
> example) encode it in XML, the old-fashioned way,* TEI-like or similar.  I
> figure that the encoding would then need to be in a different layer, or
> part, of the file, but don't actually know what its structure is, and so
> will stop talking.
> 
> Haven't found any applications that do the allowing me to see just the text
> part.  Googling "pdf structure" and "pdf guts" yields discouraging results.
> Acrobat... I can't find any features helpful in this quest.
> 
> Any tool-oriented advice?  Further, if this were possible, is it something
> you or scholars you know would be interested in using?
> 
> Many thanks in advance,
> -Vika
> 
> *Feels kind of homey to call anything digital humanists do old-fashioned.
> 
> -- 
> Vika Zafrin
> Digital Collections and Computing Support Librarian
> Boston University School of Theology
> 745 Commonwealth Avenue
> Boston, MA 02215
> 617.353.1317


--[2]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 09:33:38 +0000
        From: Richard Lewis <richard.lewis at gold.ac.uk>
        Subject: Re: [Humanist] 23.451 playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>


I don't know any of the details of how PDFs work, and so can't help
you with encoding information in them. But the XPDF package includes
several useful tools for extracting text from PDF files *where the
text is available*.  http://www.foolabs.com/xpdf/  As I understand it,
you're right in your description of the PDF file format as being
layered. It's quite possible (and JSTOR scans provide a good example)
for a PDF to contain both a picture of some text, along with real
encoded text. And my prefered PDF viewer (Evince) seems to suggest
that the encoded text may include some layout information as Evince is
able to search for and highlight strings in JSTOR PDFs.

Also, pdftk provides some similar facilities to
XPDF.  http://www.accesspdf.com/pdftk/  Both packages available on
POSIX and non-POSIX platforms.

Where the PDF is just an image of the text, I've had some success with
 http://code.google.com/p/ocropus/ . I scripted the extraction of PNG
images of each page of the PDFs and the passing of those images to
OCRopus to generate HTML which includes layout information.
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard Lewis
ISMS, Computing
Goldsmiths, University of London
Tel: +44 (0)20 7078 5134
Skype: richardjlewis
JID: ironchicken at jabber.earth.li
http://www.richard-lewis.me.uk/


--[3]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 08:13:01 -0500
        From: Elli Mylonas <elli_mylonas at brown.edu>
        Subject: Re: [Humanist] 23.451 playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>

Vika:

http://pdfbox.apache.org/ It's a java library.

   --elli

[Elli Mylonas
  Center for Digital Scholarship
  Brown University Library]

--[4]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 09:55:01 -0500
        From: Stéfan_Sinclair <sgsinclair at gmail.com>
        Subject: Re: Playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>

Dear Vika,

You ask about working with the text and structure of PDF. I think you
have in mind modifying the file and saving changes back into the PDF,
but if it can be useful to you to just extract the text, Voyeur Tools
can help (though of course there are many other solutions for just
ripping text out of PDFs):

1) go to http://voyeurtools.org/ and add your PDF text(s) (see the
"Loading Texts" video at http://hermeneuti.ca/voyeur for more info)

2) click on the "Corpus" tab in the panel on the right

3) click on the document you wish to export

4) click the export icon (the little diskette in the same Corpus panel)

5) choose DHQAuthor (XML)

Your mileage will vary, but theoretically you should get somewhat
useful paragraphs at least. I'm not sure why I've only put DHQAuthor
for now, but I'll add HTML and text (and maybe simple TEI) for the
next release. There are no doubt easier and faster ways to extract
text from PDF, but there may not be any others that allow you to do
some text analysis while you're at it ;-)

Stéfan

-- 
[Please do not reply to this message as I use this address for
communication that is susceptible to spambots. My regular email
address starts with my user handle sgs and uses the domain name
mcmaster.ca]

--

Dr. Stéfan Sinclair, Multimedia, McMaster University
Phone: 905.525.9140 x23930; Fax: 905.527.6793
Address:
    TSH-328, Communication Studies & Multimedia
    Hamilton, Ontario, Canada L8S 4M2
http://stefansinclair.name/



--[5]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 11:09:40 -0500
        From: Hugh Cayless <hcayless at email.unc.edu>
        Subject: Re: [Humanist] 23.451 playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>

PDF is a swamp.  There are loads of tools for working with it, you just need to know what you're getting into and watch where you step.  

http://en.wikipedia.org/wiki/List_of_PDF_software has a list of tools. I've personally used ghostscript (http://www.ghostscript.com/), Xpdf (http://www.foolabs.com/xpdf/), iText (http://www.lowagie.com/iText/), and PDFBox (http://pdfbox.apache.org/) to mess around with PDFs.

One thing to consider is that PDF is (not surprisingly) page-oriented, which means your encoded text would probably need to be too.  There are other considerations with PDF text: the text in a PDF may be displayed, in which case each character has a glyph, which is what's actually painted on the screen (if you're lucky, each glyph has a character, but let's not go there); or it may be hidden, and what's displayed on the page is an image, with uncorrected OCR as part of the page object for searching purposes.  The latter can be used in pretty sophisticated ways, and I don't have a deep understanding of it.  

Information about PDF can be found here: http://www.adobe.com/devnet/pdf/.  The PDF spec itself is 1310 pages, which may help explain why your search results were discouraging.

Hope this helps,

Hugh

/**
 * Hugh A. Cayless, Ph.D.
 * NYU Digital Library Technology Services
 * http://papyri.info
 */



--[6]------------------------------------------------------------------------
        Date: Mon, 23 Nov 2009 10:58:56 -0600
        From: James Smith <jgsmith at tamu.edu>
        Subject: Re: [Humanist] 23.451 playing with PDF guts?
        In-Reply-To: <20091123073436.0AE5B3A465 at woodward.joyent.us>

   
There are several low-level options, but PDF in general is not as simple 
as a TEI document.  It seems to share more personality with TeX/LaTeX.  
If you want to use a library to manage PDF file content, then I think 
wikipedia has a good enough description of how PDF documents are 
structured: http://en.wikipedia.org/wiki/PDF .

There are some libraries for manipulating PDF documents.  In Perl, 
PDF::API2 seems to be the most recent effort: 
http://search.cpan.org/~areibens/PDF-API2/ .  That's the one I'd try 
first if creating something in Perl.  The Ruby libraries seem to be 
focused on creating PDF documents instead of reading them.  Not sure 
what's available in Python, Java, or other languages/platforms.

If nothing else, looking at the Wikipedia article and the general 
abilities that the libraries make available might help you know if it is 
possible to embed a semantic encoding within a PDF document.  My initial 
reaction is that it might not be possible, or, if it is possible, it 
won't be accessible without special code in PDF readers (or encoding it 
in a way that will work with current readers -- something that might be 
outside the scope of the PDF specification).

-- Jim





More information about the Humanist mailing list