[Humanist] 26.464 distance measure; amplification making new
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Thu Nov 8 09:55:15 CET 2012
Humanist Discussion Group, Vol. 26, No. 464.
Department of Digital Humanities, King's College London
www.digitalhumanities.org/humanist
Submit to: humanist at lists.digitalhumanities.org
[1] From: Willard McCarty <willard.mccarty at mccarty.org.uk> (35)
Subject: Re: 26.459 when amplification makes new
[2] From: Neven Jovanovic <filologanoga at gmail.com> (72)
Subject: Re: 26.460 what distance measure?
--[1]------------------------------------------------------------------------
Date: Wed, 07 Nov 2012 07:04:30 +0000
From: Willard McCarty <willard.mccarty at mccarty.org.uk>
Subject: Re: 26.459 when amplification makes new
Thanks to Laval Hunsucker, who wrote in Humanist 26.459 in response to
John Laudun's mention of the extended-mind/body hypothesis regarding
tools and abilities that,
> As far as Classics ( my former field ), at least, is concerned : this jogs
> in my memory that, interestingly, such a line of thought was being
> pursued by Don Fowler not too long before his very untimely death,
> as I heard him discuss in a presentation which he gave at the colloquium
> "Computing in Classical Studies" at the ULondon Institute of Classical
> Studies back in February of 1998. Willard will probably know more
> about this and about whatever further may have come of it. And, in that
> connection, I note his own relevant "A network with a thousand
> entrances: commentary in an electronic age?" ( p.359-402 in _The
> classical commentary: histories, practices, theory_ / ed. by R.K.
> Gibson& C. Shuttleworth Kraus. - Brill, 2002 ) -- an article which he
> in fact dedicated to Fowler. Perhaps he might wish to say something
> more about this.
In my original posting I asked in what sense any of us *has* an ability
whose exercise depends on an external device, in this case, a computer.
The usual argument focuses on moments or periods of time in which the
device is being used, e.g. the person has his or her hands on the tiller
of a boat and so moves through the water as would otherwise be
impossible. But what about when he or she is asleep in bed? Walking the
dog? Where is the ability then? What about if the person should suffer a
horrible accident and lose the arm required for sailing? Lose his or her
sight?
Perhaps I am just playing with words. Is there a real question here?
Yours,
WM
--
Willard McCarty, FRAI / Professor of Humanities Computing & Director of
the Doctoral Programme, Department of Digital Humanities, King's College
London; Professor, School of Computing, Engineering and Mathematics,
University of Western Sydney; Editor, Interdisciplinary Science Reviews
(www.isr-journal.org); Editor, Humanist
(www.digitalhumanities.org/humanist/); www.mccarty.org.uk/
--[2]------------------------------------------------------------------------
Date: Wed, 7 Nov 2012 17:46:51 +0100
From: Neven Jovanovic <filologanoga at gmail.com>
Subject: Re: 26.460 what distance measure?
In-Reply-To: <20121107064419.D725A5FE5 at digitalhumanities.org>
Dear Tom and Humanist,
my math is hopelessly inadequate, but I can ofer some linguistic remarks:
1. it is not clear what you're actually counting: tags or differences
("even after I reduced the part-of-speech tags to single alphanumeric
characters to eliminate noise from different-length tags")?
2. If you're counting differences, how do you treat repetition ("fie fie")?
3. The method you've described suggests that "Once more adieu!" and
"Fie, Publius, fie!" are similar -- and to me they do seem similar:
both are incomplete sentences lacking the verb, both consist of three
elements. Now, if "Fie, Publius, fie!" turns up to be similar to "I
need you" as well -- then, I guess, you have a methodological problem.
It so happens that I'm currently experimenting with something related
-- trying to compare translations using Levenshtein distance, but not
in a scientific way, more metaphorically. Even so, it seemed necessary
to introduce separate edit distances on different linguistic layers:
lexical, grammatical, semantic. My notes can be found here:
http://www.ffzg.unizg.hr/klafil/dokuwiki/doku.php/z:levenshtein-translation .
So, in the end, I can only second Tom's request: if there is work that
we should know about, please do enlighten us!
Best,
Neven
Neven Jovanovic
Zagreb, Croatia
On 7 November 2012 07:44, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
> Humanist Discussion Group, Vol. 26, No. 460.
> Department of Digital Humanities, King's College London
> www.digitalhumanities.org/humanist
> Submit to: humanist at lists.digitalhumanities.org
>
>
>
> Date: Tue, 6 Nov 2012 23:06:12 +0000
> From: Tom Salyers <tom.d.salyers at gmail.com>
> Subject: What distance measure should I be using for string similarity?
>
> Here's the executive summary: I'm trying to cluster sentences from
> about twenty Elizabethan plays together based on how similar their
> grammatical structures are. To that end, I've compiled a database of
> the sentences from an XML corpus that has each word tagged with its
> part of speech. For instance, the sentence "Now Faustus, what wouldst
> thou have me do?" has the structure "av np pu q vm p vh p vd pu".
>
> So far, so good. The problem is that since sentences are such
> flexible, modular things, there's no hard-and-fast way to assign a
> sentence into a particular category. What I've finally settled on is
> clustering to assign sentences to categories by their similarity--most
> likely k-medoid clustering, since my original approach, hierarchical
> agglomerative clustering, was hugely time-consuming. (On the order of
> O of n^2.)
>
> My problem arises when trying to compute similarities and/or distances
> between the sentences. I originally was trying Levenshtein distance,
> but it seems to be skewing the results for short but
> structurally-different sentences, even after I reduced the
> part-of-speech tags to single alphanumeric characters to eliminate
> noise from different-length tags. For instance, I'm getting "Fie,
> Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put
> in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"),
> which shouldn't really be happening--but the edit distance is so much
> smaller between them and the longer sentences that they're getting
> dropped into the same bucket.
>
> I've started toying around with things like cosine similarity, and to
> that end have reduced my sentences to n-dimensional
> frequency-of-occurrence vectors for each POS tag...but I'm wondering
> if there's a better measure out there that I just haven't heard of.
> Can anyone point me in the right direction? Thanks in advance, and
> please let me know if you need more details.
>
> --
> Tom Salyers
More information about the Humanist
mailing list