[Humanist] 26.464 distance measure; amplification making new

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Nov 8 09:55:15 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 464.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Willard McCarty <willard.mccarty at mccarty.org.uk>          (35)
        Subject: Re:  26.459 when amplification makes new

  [2]   From:    Neven Jovanovic <filologanoga at gmail.com>                  (72)
        Subject: Re:  26.460 what distance measure?


--[1]------------------------------------------------------------------------
        Date: Wed, 07 Nov 2012 07:04:30 +0000
        From: Willard McCarty <willard.mccarty at mccarty.org.uk>
        Subject: Re:  26.459 when amplification makes new


Thanks to Laval Hunsucker, who wrote in Humanist 26.459 in response to 
John Laudun's mention of the extended-mind/body hypothesis regarding 
tools and abilities that,

> As far as Classics ( my former field ), at least, is concerned :  this jogs
> in my memory that, interestingly, such a line of thought was being
> pursued by Don Fowler not too long before his very untimely death,
> as I heard him discuss in a presentation which he gave at the colloquium
> "Computing in Classical Studies" at the ULondon Institute of Classical
> Studies back in February of 1998.  Willard will probably know more
> about this and about whatever further may have come of it. And, in that
> connection, I note his own relevant "A network with a thousand
> entrances: commentary in an electronic age?" ( p.359-402 in _The
> classical commentary: histories, practices, theory_ / ed. by R.K.
> Gibson&  C. Shuttleworth Kraus. - Brill, 2002 ) -- an article which he
> in fact dedicated to Fowler. Perhaps he might wish to say something
> more about this.

In my original posting I asked in what sense any of us *has* an ability 
whose exercise depends on an external device, in this case, a computer. 
The usual argument focuses on moments or periods of time in which the 
device is being used, e.g. the person has his or her hands on the tiller 
of a boat and so moves through the water as would otherwise be 
impossible. But what about when he or she is asleep in bed? Walking the 
dog? Where is the ability then? What about if the person should suffer a 
horrible accident and lose the arm required for sailing? Lose his or her 
sight?

Perhaps I am just playing with words. Is there a real question here?

Yours,
WM

-- 
Willard McCarty, FRAI / Professor of Humanities Computing & Director of
the Doctoral Programme, Department of Digital Humanities, King's College
London; Professor, School of Computing, Engineering and Mathematics,
University of Western Sydney; Editor, Interdisciplinary Science Reviews
(www.isr-journal.org); Editor, Humanist
(www.digitalhumanities.org/humanist/); www.mccarty.org.uk/


--[2]------------------------------------------------------------------------
        Date: Wed, 7 Nov 2012 17:46:51 +0100
        From: Neven Jovanovic <filologanoga at gmail.com>
        Subject: Re:  26.460 what distance measure?
        In-Reply-To: <20121107064419.D725A5FE5 at digitalhumanities.org>

Dear Tom and Humanist,

my math is hopelessly inadequate, but I can ofer some linguistic remarks:

1. it is not clear  what you're actually counting: tags or differences
("even after I reduced the part-of-speech tags to single alphanumeric
characters to eliminate noise from different-length tags")?

2. If you're counting differences, how do you treat repetition ("fie fie")?

3. The method you've described suggests that "Once more adieu!" and
"Fie, Publius, fie!" are similar -- and to me they do seem similar:
both are incomplete sentences lacking the verb, both consist of three
elements. Now, if "Fie, Publius, fie!" turns up to be similar to "I
need you" as well -- then, I guess, you have a methodological problem.

It so happens that I'm currently experimenting with something related
-- trying to compare translations using Levenshtein distance, but not
in a scientific way, more metaphorically. Even so, it seemed necessary
to introduce separate edit distances on different linguistic layers:
lexical, grammatical, semantic. My notes can be found here:
 http://www.ffzg.unizg.hr/klafil/dokuwiki/doku.php/z:levenshtein-translation .

So, in the end, I can only second Tom's request: if there is work that
we should know about, please do enlighten us!

Best,

Neven

Neven Jovanovic
Zagreb, Croatia

On 7 November 2012 07:44, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
>                  Humanist Discussion Group, Vol. 26, No. 460.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Tue, 6 Nov 2012 23:06:12 +0000
>         From: Tom Salyers <tom.d.salyers at gmail.com>
>         Subject: What distance measure should I be using for string similarity?
>
> Here's the executive summary: I'm trying to cluster sentences from
> about twenty Elizabethan plays together based on how similar their
> grammatical structures are. To that end, I've compiled a database of
> the sentences from an XML corpus that has each word tagged with its
> part of speech. For instance, the sentence "Now Faustus, what wouldst
> thou have me do?" has the structure "av np pu q vm p vh p vd pu".
>
> So far, so good. The problem is that since sentences are such
> flexible, modular things, there's no hard-and-fast way to assign a
> sentence into a particular category. What I've finally settled on is
> clustering to assign sentences to categories by their similarity--most
> likely k-medoid clustering, since my original approach, hierarchical
> agglomerative clustering, was hugely time-consuming. (On the order of
> O of n^2.)
>
> My problem arises when trying to compute similarities and/or distances
> between the sentences. I originally was trying Levenshtein distance,
> but it seems to be skewing the results for short but
> structurally-different sentences, even after I reduced the
> part-of-speech tags to single alphanumeric characters to eliminate
> noise from different-length tags. For instance, I'm getting "Fie,
> Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put
> in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"),
> which shouldn't really be happening--but the edit distance is so much
> smaller between them and the longer sentences that they're getting
> dropped into the same bucket.
>
> I've started toying around with things like cosine similarity, and to
> that end have reduced my sentences to n-dimensional
> frequency-of-occurrence vectors for each POS tag...but I'm wondering
> if there's a better measure out there that I just haven't heard of.
> Can anyone point me in the right direction? Thanks in advance, and
> please let me know if you need more details.
>
> --
> Tom Salyers




More information about the Humanist mailing list