[Humanist] 24.236 JSTOR and diacritics?
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Tue Aug 3 22:38:33 CEST 2010
Humanist Discussion Group, Vol. 24, No. 236.
Centre for Computing in the Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Tue, 3 Aug 2010 12:56:30 +0000
From: Peter Batke <batke_p at hotmail.com>
Subject: jstor and diacritics
Could someone please help me with diacritics and jstor.
A search of Humanist yielded only random references to jstor and diacritics. A general search of Google did show that Endnote has had problems with exporting of jstor references. It seems Endnote has written a filter. The search: jstor diacritics is complicated by the fact that there is a journal in jstor of that name.
Here is the background: I have been using jstor for many years, generally only to read or print the articles.
Since my retirement and subsequent loss of routine jstor access I have been using a public library to e-mail
the articles to myself as attachments.
Here is the problem: in course of handling the pdf's, I marked a page and cut and pasted the content to Textpad - just to see what would happen. To my surprise the ocr'd words showed up on the Textpad page. Great. To my even greater surprise the European diacritics seem not to have been enabled during ocr. The German word "grösser" showed up as "gr6sser."
I find this hard to believe and hope someone can tell me I am having a bad dream or that I don"t understand something.
In course of my recent project on Google Books (google: batke google books - for a free download - or you can buy it from Amazon) I have had to evaluate much critique of Google for bad ocr. There are reasons for bad ocr, many reasons, and not everybody understands them - which is fine. However, for one of the parade projects of digital humanities not to have enabled diacritics during ocr seems almost hard to grasp.
Before I succumb to outrage, which I am prepared to do, let me ask if this is just an example of pardonable bad ocr, or are diacritics really not enabled.
I was poking around in old German periodicals e.g. "Die Welt des Islam" l- "Oriens" looking for Babinger, Wittek etc., who I found but no Umlauts - lots of 6'es and double i's.
Some searches of actual words with diacritics e.g. größer yielded mixed results - as though the search engine were programmed to normalize search strings away from diacritics.
To give everyone an example that will demonstrate the problem: do a Google search of - jstor b6ll - you will get jstor references to Heinrich Böll. Do a search for - jstor böll - and you get randomness mostly about "boll weevils."
Any clarification would be appreciated, or just shoot me when you see me. cheers, Peter Batke
More information about the Humanist