[Humanist] 27.780 characters not in UniCode

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Mon Feb 10 07:00:26 CET 2014


                 Humanist Discussion Group, Vol. 27, No. 780.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Sun, 9 Feb 2014 19:43:43 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  27.777 characters not in UniCode
        In-Reply-To: <20140209070730.E81CC622E at digitalhumanities.org>


Hi Maurizio,

thank you very much for this information. It is most helpful, and thank you
also to James, Alan and Laura for their references. I have now a wealth of
information to fight back against the criticism of an argument, which I
will explain briefly since Maurizio asks.

It is simply that I was trying to make the case that the nature of
interpretation when transcribing a character is different from that
employed when using markup. If one sees an encodable Unicode character you
can only choose to encode it or not - so this interpretation is binary. You
might not be interested in that text. For example, do you encode the
running header on a printed page? But no (sane) person would dispute that a
clear "T" is a "T", even though in principle they could. So pragmatically
this is not really an interpretation, or it is a different one from
recording a simple textual feature like italics, which does need markup.
For then I have a choice of a dozen codes in a dozen markup languages to
represent it. But a character these days is mostly Unicode, and if not it
can be easily translated one for one into Unicode.

The counter-argument went like this: there are some characters not in
Unicode that require markup to represent them. In TEI you would use <char>
to define it and <g> to use it. But that's markup, and hence it's a clear
and unambiguous character that needs interpretation to represent it.

But this situation is no different from seeing a maths formula for which
you need MathML or TEX, or an inline graphic. So all those things can be
lumped together as "markup" and you are left with the original argument: if
the character is in UniCode and the glyph is clear and non-ambiguous, then
it is not really interpretative. And that covers most of the text that you
transcribe. So the argument that you can't separate markup from text
because it is all interpretative is weak. At least on technological grounds
you reduce the amount of interpretation in the transcription by an order of
magnitude when you remove the markup, including the non-Unicode characters
and maths (why not?). What remains contains some kinds of interpretation
like spaces and tabs and carriage returns, which are admittedly formats,
but it's trivial compared to the interpretation involved in deciding ways
to record other textual features that can't be represented via Unicode
"text".

On Sun, Feb 9, 2014 at 5:07 PM, Humanist Discussion Group <
willard.mccarty at mccarty.org.uk> wrote:

>                  Humanist Discussion Group, Vol. 27, No. 777.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Sat, 08 Feb 2014 10:00:42 +0100
>         From: maurizio lana <maurizio.lana at gmail.com>
>         Subject: Re:  27.760 characters not in UniCode?
>
>
> at digilibLT - digital library of late latin texts we deal also with
> scientific texts. many characters , mainly but not only units of measure
> of those texts are missing in Unicode.
> after a rather quick recognition with the help of david paniagua
> (universidad de salamanca), simona musso and valentina rinaldi (both of
> università  del piemonte orientale) who work for the library digilibLT,
> we can list these groups of characters:
>
>   * roman numerals with multiplier mark
>   * greek numerals with multiplier mark: see a list with images of
>     missing characters at
>
> https://drive.google.com/file/d/0B1SZjoqdPETSaTlnd0ROOHJiUXM/edit?usp=sharing
>   * units of measure: for a list, see the PDF doc at
>
> https://drive.google.com/file/d/0B1SZjoqdPETSX2h6MEVtN1ZVMWM/edit?usp=sharing
>     where many characters are listed which don't have a Unicode code
>     which represents them (see all the characters which are described
>     with"null" or with more than 2 Unicode codes); other characters in
>     the document show a Unicode code and a glyph but that couple really
>     refers to another 'entity' which happens to have the same glyph: it
>     is the case for example of sescuncia which happens to have the same
>     glyph of the british currency "pound" so when your code has
>     sescuncia you put in the digital 'rendering' of the text the glyph
>     of the british pound. this should be avoided, but to avoid it you
>     need a specific character in Unicode for sescuncia, even if its
>     glyph is identical to an already existing one
>   * ligatures: for a list of ligatures for units of measure at
>
> https://drive.google.com/file/d/0B1SZjoqdPETSSzhjbXM5QUFrekk/edit?usp=sharing
> .
>     they can obviously be replaced by their disconnected elements, but
>     if we want to produce a diplomatic edition it is not the same to
>     reproduce, and offer to the reader, the ligature which because of
>     its peculiar stroke could have lead to a certain error, or to read
>     the single elements whose strokes cannot be mis-read or
>     misinterpreted. so probably we need also specific characters for
>     ligatures.
>
> best
> maurizio
>
> PS: desmond, why are you doing this catalogue of missing Unicode
> characters? can we hope in an initiative towards Unicode consortium to
> enrich the definitions of the encoding?
> :-))
>
> --
> The knowledge gap between rich and poor is widening.
> I. H. Witten, D. Bainbridge, D. M. Nichols,
> How to build a digital library, p. 26
> -------
> il corso di informatica umanistica:
> http://www.youtube.com/watch?v=85JsyJw2zuw
> la biblioteca digitale del latino tardo: http://www.digiliblt.unipmn.it/
> a day in the life of DH2013: http://dayofdh2013.matrix.msu.edu/digiliblt/
> che cosa sono le digital humanities:
> http://www.youtube.com/watch?v=4JqLst_VKCA
> -------
> Maurizio Lana - ricercatore
> Università del Piemonte Orientale, Dipartimento di Studi Umanistici
> via Manzoni 8, 13100 Vercelli - tel. +39 347 7370925
> -------
> il corso di informatica umanistica:
> http://www.youtube.com/watch?v=85JsyJw2zuw
> la biblioteca digitale del latino tardo: http://www.digiliblt.unipmn.it/
> a day in the life of DH2013: http://dayofdh2013.matrix.msu.edu/digiliblt/
> che cosa sono le digital humanities:
> http://www.youtube.com/watch?v=4JqLst_VKCA
> -------
> Maurizio Lana - ricercatore
> Università del Piemonte Orientale, Dipartimento di Studi Umanistici
> via Manzoni 8, 13100 Vercelli - tel. +39 347 7370925




More information about the Humanist mailing list