[Humanist] 24.250 JSTOR and diacritics
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Wed Aug 11 22:44:04 CEST 2010
Humanist Discussion Group, Vol. 24, No. 250.
Centre for Computing in the Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Wed, 11 Aug 2010 15:38:07 +0000
From: Peter Batke <batke_p at hotmail.com>
Subject: jstor and diacritics
Thanks to the intrepid few who have expressed interest in the problem of shaky ocr in jstor.
I have resisted stimulating interest in the topic by writing things like: "It seems humanist subscribers are more interested in creating markup for a glyph that occurs one time in the history of written expression than in chronic misrepresentation of uncounted glyphs of European diacritics in one of the premier large scale digitization projects - making searches of scholarship a crap shoot." That would have made the high security cages spring open no doubt. But I rejected the temptation to write things like the above. Yet we should consider the implications of lavishing untold hours on marking up 80 manuscripts of some Middle English narrative while ceding the electronic representation of the collected scholarship of the last two centuries to a non-profit institution that could be construed as an efficient manager of revenue flow from institutions to publishers.
Everyone is of course operating in the public interest, to be sure. My interest as one of the locked out retired is of course of no interest to anyone - as are the interests of all those not part of the 3000 institutions paying up their subscription.
Jstor is now 15 years old and perhaps it is time to renegotiate the deal. In my recent work on large scale digitization: see Google Book Search and its Critics, I have dared to suggest that some of the toll barriers should fall. The Connecticut Turnpike removed its toll booths after the bonds were paid off. We should be suspicious of non-profit megaliths who keep offering new services for fee while really charging for access to a considerable amount of materials that have been in the public domain for many decades. And doing it badly at that. There is an aroma of elitism here that I recognize from a life-time of working at east-coast universities, the primacy of service to self.
I would like to make three points to summarize this first phase of the discussion (feel free to start a new thread). 1. Is it a pdf problem? 2. Have we lost an understanding of how ocr is done? 3. Can Google Book Search embolden us to demand free access to journals published before the Second World War?
The sparse off-line discussion as well as the even sparser discussion on the list have suggested that there is a "pdf problem" - a problem either wih my use of pdf or with pdf itself. The notion that there could be a problem with jstor did not seem likely - on its face. Pdf has come a long way since 1995 when jstor was started and I wonder if the practice of layering a text layer over the graphics layer, a layer of the text of individual words tied to the location on the page that can be marked and copied, is still best practice. One might consider that electronic representation of the whole text of an article might be a better delivery vehicle. I am not insensitive to potential problem, but delivering page images for eye-ball scanning is not where this is eventually going, eta soon. If jstor will not realize that this data has to be made available for mining (and I don't know where the discussion stands), I could imagine redoing at least parts of the scholarship of the 19th century in short order. Google has proven it can be done. However that may be worked out in the future, the problems with "Schiitz" and with "all6" are standard ocr problems - that point to a careless operator and questionable quality control. Pdf may well be improved as well in time.
Point two. Why do folks not recognize the signs of bad ocr, or seem to mind? Perhaps there has been a generational shift, and certain insights that those of us who did ocr in the 80's made, laboriously with much trial and error, have been lost. I would NEVER sign off on an ocr job without first looking at an aphabetical word list of the whole job. I would look at the beginning and at the end of the list. That is where errors like to congregate. I would also check the beginning and end of each letter, ditto. Of course you check for numbers mixed in with letters. It seems shocking that jstor appears not to have such utilities at its disposal. Of course all this can be fixed and should be fixed soon. A couple of perl programs could go through jstor in a week and write out a list of specific replace candidates that could be implemented in another week. European diacritics must ALWAYS be enabled even in English-only texts to catch references in footnotes. I would create an exemption for "really large scale digitization" (Google Books) that works with quick and dirty ocr in an initial phase.
Finally, the fact that Louisiana is pulling the plug on jstor is sad. It is sad because they will have pulled the plug on any number of other for fee services in the delivery of electronic text before getting to jstor. This widens the gap between those who have and those who have not. The pool of the have-nots is already large, I know, I am in it. So let us consider the notion that electronic journals that are out of copyright should really be free to all, to quote Prof. Darnton's favorite phrase from the founding fathers. I would like to see the delivery of current journals be separated from the archive. I would like to have the whole notion of taking public domain materials and delivering them for a charge under the guise of adding value or convenience be reexamined. Again Google is showing the way. OK, if it is protected by law, then fees will have to be paid at a fair schedule. If is is in the public domain, it it as free as the book in the library was intended to be by Geoorge, John, Ben and Tommy and a couple of James'.
I am not finished, but I have reached the limits for a posting. cheers, Peter [currently wirelessly at the beach]
More information about the Humanist