[Humanist] 27.769 characters not available in UniCode

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Feb 6 08:24:27 CET 2014

                 Humanist Discussion Group, Vol. 27, No. 769.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    James Cummings <James.Cummings at it.ox.ac.uk>               (29)
        Subject: Re:  27.765 characters not in UniCode

  [2]   From:    Alan Corre <corre at uwm.edu>                                 (8)
        Subject: Characters not available in Unicode

        Date: Wed, 05 Feb 2014 12:58:09 +0000
        From: James Cummings <James.Cummings at it.ox.ac.uk>
        Subject: Re:  27.765 characters not in UniCode
        In-Reply-To: <20140205090151.CB51F624A at digitalhumanities.org>

On 05/02/14 09:01, Humanist Discussion Group wrote:
> Dear Desmond (and all),
> For a list of characters that aren't currently in Unicode, see the Medieval
> Unicode Font Initiative (MUFI), which proposes new characters for inclusion
> in Unicode.  This includes things like common scribal abbreviations, some
> of which made the transition to print. It does, however, also include
> ligatures and accents.  http://www.mufi.info

As part of the ENRICH project I helped to create a 'gaiji' bank 
at http://www.manuscriptorium.com/apps/gbank/ which was directly 
derived from the MUFI characters using the private-use-area at 
the time of the project. In each instance it gives a sample of 
XML encoding of the non-unicode character and the xml source of a 
<char> element which could be included in a <charDecl> in the 
header of a TEI file.  The TEI, as I'm sure you know, has long 
coped with the fact that there are characters not yet in Unicode 
(sometimes for perfectly acceptable reasons) which people wish to 
encode and document.  In other cases they may wish to track 
particular scribal features or variants (e.g. single-compartment 
vs double-compartment 'a'). The TEI module for doing this 'gaiji' 
is documented at:
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/WD.html For 
some more information about the gaiji bank see: 

Just some additional related information really.


Dr James Cummings, James.Cummings at it.ox.ac.uk
Academic IT Services, University of Oxford

        Date: Wed, 5 Feb 2014 11:25:21 -0600 (CST)
        From: Alan Corre <corre at uwm.edu>
        Subject: Characters not available in Unicode
        In-Reply-To: <1696985272.8265200.1391620589430.JavaMail.root at uwm.edu>

Thanks to Laura for drawing attention to the valuable MUFI initiative. I should like to make some additional remarks relevant to this issue.

1. Unicode is one of the miracles made possible by the digital revolution, and a great achievement. It aims to represent all the myriad symbols used by mankind from ancient times to represent speech, and is still a work in progress. Ligation in many cases is solved in a remarkable fashion. The Tamil syllabary starting at 0B83 hex represents no less than 326 symbols or combined symbols. They are represented by 1 or 2 hex numbers, occasionally three or four, and when these appear to the reader, they are ligated by the software as if by magic. Truly an excellent solution to a difficult problem.

2. Hebrew is very simple if only its consonantal form is taken into account, consisting of 22 letters + 5 which replace 5 of those letters (usually) at word ending. However, the are two ligatures, alef-lamed and ayin-lamed which occur frequently in older texts, and I found this troubling. With some help from Alan Wood, I found that there is a hex code for alef-lamed, but there is still a problem. The browsers handle this symbol in different ways, most quite unsatisfactory. Since there is purely a graphic issue, I have decided to represent the ligated forms by the separate forms, and simply note the fact. Unicode does take care of the upside down nun which occurs in the Pentateuch to mark off certain verses, as well as the numerous diacritics used in the Hebrew Bible. The vowels are also routinely inserted in Hebrew poetry.

3. While we are on the subject of unusual graphemes, I point out that the grapheme often erroneously called  "Px", is actually a fancy R, originally an instruction from the physician to the pharmacist: "Take!"  in Latin "recipe!", followed by the ingredients of his nostrum.  

I would also like to make a suggestion, for what it is worth, about the dollar sign $. Wikipedia offers so many hypotheses about its possible origin, that one may be sure the matter is still undecided. My suggestion is as follows. "dollar" is derived from the German word "thaler" (the "th" is replaced by "t" in current German) the root of which means "to count" and is cognate with English "tell", the older meaning of which is retained in the word "teller" applied to the individual who counts coins in the bank. The "reichsthaler" which might be rendered "government dollar" was a coin valued at 9 or 10 regular dollars. The storied "Maria Theresa dollar", forever dated 1740, was the standard international silver coin for centuries. Now long words sometimes get beheaded. "horologium"  in Latin becomes "reloj" in Spanish, "relógio" in Portuguese.  More to the point, 'sGravenhage is the fancy name for the Dutch city of den Haag, called in English "The Hague". The full original form was "des Graven Hage" meaning "the garden of the Count." ("des" in the genitive of the definite article.) I suggest that $ comes from the shortened form "*sthaler".

Alan D. Corré
Emeritus Professor of Hebrew Studies
University of Wisconsin-Milwaukee

More information about the Humanist mailing list