[Humanist] 23.451 playing with PDF guts?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Nov 23 08:34:36 CET 2009

                 Humanist Discussion Group, Vol. 23, No. 451.
         Centre for Computing in the Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Fri, 20 Nov 2009 12:48:55 -0500
        From: Vika Zafrin <vzafrin at bu.edu>
        Subject: Playing with PDF guts?

Dear Humanists,

I had a feeling of having asked this question here before, but can't find it
in my email archives; please forgive any duplication.

I'm having a persistent desire to semantically encode text embedded in PDF
files, particularly OCR'd files.  There's got to be a... layer?... in that
format where it's just the text.  Or perhaps text with styling.  I'd like to
get to that layer, be able to *see* it, and then figure out how to (for
example) encode it in XML, the old-fashioned way,* TEI-like or similar.  I
figure that the encoding would then need to be in a different layer, or
part, of the file, but don't actually know what its structure is, and so
will stop talking.

Haven't found any applications that do the allowing me to see just the text
part.  Googling "pdf structure" and "pdf guts" yields discouraging results.
Acrobat... I can't find any features helpful in this quest.

Any tool-oriented advice?  Further, if this were possible, is it something
you or scholars you know would be interested in using?

Many thanks in advance,

*Feels kind of homey to call anything digital humanists do old-fashioned.

Vika Zafrin
Digital Collections and Computing Support Librarian
Boston University School of Theology
745 Commonwealth Avenue
Boston, MA 02215

More information about the Humanist mailing list