[Humanist] 24.238 JSTOR and diacritics

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Aug 5 00:18:10 CEST 2010


                 Humanist Discussion Group, Vol. 24, No. 238.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Wed, 4 Aug 2010 10:37:12 +0100
        From: "Stephen Woodruff" <s.woodruff at arts.gla.ac.uk>
        Subject: RE: [Humanist] 24.236 JSTOR and diacritics?
        In-Reply-To: <20100803203833.13AD02A21D at woodward.joyent.us>


> re Subject: jstor and diacritics

The problem is probably PDF not TextPad. (I assume you tried pasting
into something else, like a word processor?)
There are many ways of creatin
g and encoding a PDF file, and not all
result in text which can be copied and pasted if the text includes more
than standard Ascii characters. Normal word processors hold a
internationally accepted numerical representation of each letter plus a
note of its font, size, colour and so on. So you can search for an "a"
without caring whether its in Arial or Times, red or italic, and you can
copy that numerical representation to another application, even if it
doesn't understand colour or have the same fonts.

PDF doesn't always work like that. Some encodings are analogous to what
a typical word processor would use, some are not: they store glyphs,
effectively pictures of the individual letters, and have a table to
convert back between those and the character codes needed by a
copy-paste operation. Its that conversion back that can go wrong: you
can read the PDF files and print them because all your eyes and the
printer need are the shapes, but if they have been created badly you can
not reliably extract the text.
(I'm trying hard not to start complaining about the use of PDF, which is
a PAGE description language not a TEXT description language, in the
academic world.)

James King of Adobe explains things well in his blog at
http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html
and justifies it by explaining PDF's purpose.  For more technical detail
on PDF font encoding see
http://www.4xpdf.com/2010/03/technical-background-to-pdf-font-options/

> Any clarification would be appreciated, or just shoot 
> me when you see me. cheers, Peter Batke

Let's shoot PDF instead.





More information about the Humanist mailing list