[Humanist] 26.467 distance measure

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Fri Nov 9 07:20:56 CET 2012


                 Humanist Discussion Group, Vol. 26, No. 467.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Bryan Jurish <moocow.bovine at gmail.com>                   (116)
        Subject: Re:   26.460 what distance measure?

  [2]   From:    Desmond Schmidt <desmond.allan.schmidt at gmail.com>         (67)
        Subject: Re:  26.460 what distance measure?


--[1]------------------------------------------------------------------------
        Date: Thu, 08 Nov 2012 14:47:15 +0100
        From: Bryan Jurish <moocow.bovine at gmail.com>
        Subject: Re:   26.460 what distance measure?
        In-Reply-To: <509A3449.8040301 at bbaw.de>

morning Tom, morning list,

This is my first post to this list, so I suppose a brief introduction is
in order: I'm a computational linguist originally trained as a
philosopher and a confirmed rationalist with a penchant for empirical
data.  A colleague of mine kindly forwarded me Tom's posting to the
list, and since I'm interested in this kind of thing as well, it piqued
my interest...

On 2012-11-06 23:06:12, Tom Salyers appears to have written:
> Here's the executive summary: I'm trying to cluster sentences from
> about twenty Elizabethan plays together based on how similar their
> grammatical structures are. 

Judging from the rest of the posting, I'm guessing you're interested
primarily in syntactic structure and not the semantics, correct?

> To that end, I've compiled a database of
> the sentences from an XML corpus that has each word tagged with its
> part of speech. For instance, the sentence "Now Faustus, what wouldst
> thou have me do?" has the structure "av np pu q vm p vh p vd pu".

Idle speculation: when everything else is working, it might be useful to
run a parser over the sentences and work with tree-postorder strings;
e.g. for "S(NP(NE(John)),VP(V(loves),NP(NE(Mary))" you'd get
"NE,NP,V,NE,NP,VP,S" rather than just the less-informative "NE,V,NE":
that way you might be able to get a handle on constituent structure
rather than just POS-substrings, but I digress...

> So far, so good. The problem is that since sentences are such
> flexible, modular things, there's no hard-and-fast way to assign a
> sentence into a particular category. What I've finally settled on is
> clustering to assign sentences to categories by their similarity--most
> likely k-medoid clustering, since my original approach, hierarchical
> agglomerative clustering, was hugely time-consuming. (On the order of
> O of n^2.)

Yup.  You might look at [1] for an alternative to "pure" hierarchical
clustering which manages to sneak around the quadratic complexity by
using a 2-pass approach: hierarchically cluster a small number
(sqrt(kn)) of (randomly chosen) prototypes and subsequently attach the
remaining items to the prototype-clusters.

> My problem arises when trying to compute similarities and/or distances
> between the sentences. I originally was trying Levenshtein distance,
> but it seems to be skewing the results for short but
> structurally-different sentences, even after I reduced the
> part-of-speech tags to single alphanumeric characters to eliminate
> noise from different-length tags. For instance, I'm getting "Fie,
> Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put
> in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"),
> which shouldn't really be happening--but the edit distance is so much
> smaller between them and the longer sentences that they're getting
> dropped into the same bucket.

This is a well-known problem with the Levenshtein metric: the maximum
distance depends on the lengths of the argument strings.  In theory, you
can adjust for this by scaling each comparison by an "expected distance"
a la [2], setting

  d'(a,b) = (d(a,b)-E(d|a,b)]) / (max(d|a,b)-d(a,b))

... unfortunately, afaik, no closed-form solution is currently known for
the Levenshtein distance between two random (read "uncompressable")
strings of given length (but you can get a pretty good empirical
estimate for a small alphabet size like yours).  Add to that the fact
that POS-strings for natural language sentences are by no means random
(otherwise statistical taggers would be utterly useless), and the
Levenshtein distance starts looking less attractive.

> I've started toying around with things like cosine similarity, and to
> that end have reduced my sentences to n-dimensional
> frequency-of-occurrence vectors for each POS tag...but I'm wondering
> if there's a better measure out there that I just haven't heard of.
> Can anyone point me in the right direction? Thanks in advance, and
> please let me know if you need more details.

I suspect that cosine similarity over POS-unigram frequencies won't give
very satisfactory results if you're interested in syntax as such.  For
one thing, POS frequencies (like most other natural language frequency
distributions) aren't usually amenable to linear approximations (which
is what cosine-distance is really doing under the hood).  For
medium-to-large samples, log-frequencies *do* tend to work well [my
hunch: probably because of their relation to model-entropy, since log(f)
~ -log(p(f))], but a single sentence as a sample is likely too small for
that to be of much use.

There is a very great deal of work out there on string similarity
measures, usually in the context of spelling correction or "fuzzy"
search; [3] gives a pretty good survey.  Since I'm assuming you want to
cluster based on syntactic properties, I think your easiest path forward
would be to go with an n-gram distance measure like [4,5], possibly
adjusting it for expected and/or maximum values, or smoothing by
interpolation (in order to avoid zeroes due to sparse data).

If you're feeling ambitious, you might also look into information-based
distance measures like those in [6,7] (but note that these often don't
play nicely together with "standard" clustering algorithms) -- in
theory, these should be able compensate for common similarities (e.g.
sential properties like has-a-subject, has-an-inflected-verb) while
still punishing items which lack those properties.

Apologies for the long-winded reply; I hope it's at least mildly useful.
 Also, your project sounds very interesting -- please let me know how it
turns out, and if you have a more detailed description than your
"executive summary", I'd be interested in that too!

marmosets,
	Bryan

REFERENCES

[1] D. Cutting, J. O. Pedersen, D. Karger, and J. W. Tukey.
Scatter/gather: A cluster-based approach to browsing large document
collections. In Proceedings of the Fifteenth Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
pp. 318­-329, 1992.

[2] L. Hubert and P. Arabie. Comparing partitions. Journal of
Classification, 2:193­-218, 1985.

[3] G. Navarro. A guided tour to approximate string matching. ACM
Computing Surveys, 33(1):31­-88, 2001.

[4] Esko Ukkonen. Approximate string-matching with q-grams and maximal
matches. Theoretical Computer Science, 92:191­-211, 1992.

[5] G. Kondrak. N-gram similarity and distance. In Proc. Twelfth IntÂ’l
Conf. on String Processing and Information Retrieval, pp. 115-126, 2005.

[6] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L.
Mercer. Class-based n-gram models of natural language. Computational
Linguistics, 18(4):467­479, 1992.

[7] P. M. B. Vitányi, F. J. Balbach, R. L. Cilibrasi, and M. Li.
Normalized Information Distance.  In F. Emmert-Streib and M. Dehmer
(eds), Information Theory and Statistical Learning, Springer, pp. 45-82,
2009.

-- 
Bryan Jurish                           "There is *always* one more bug."
moocow.bovine at gmail.com         -Lubarsky's Law of Cybernetic Entomology



--[2]------------------------------------------------------------------------
        Date: Fri, 9 Nov 2012 11:47:52 +1000
        From: Desmond Schmidt <desmond.allan.schmidt at gmail.com>
        Subject: Re:  26.460 what distance measure?
        In-Reply-To: <20121107064419.D725A5FE5 at digitalhumanities.org>

Hi Tom,

Since you have reduced your patterns to strings why not use a n-gram
approach to similarity, like a BLAST search?
Adapting it to your problem you might try comparing two patterns by
the number of 3-grams they share, normalised by the number of such
trigrams that they *could* share. So EABCD would share one trigram
with ABC out of a possible 3 (score==.333) whereas it would share
nothing with AEBDC (score 0), which has the same Levenshtein distance
from ABC. This preserves patterns better than a crude edit distance.
This is just a suggestion. I haven't tried it and it may not work.

On Wed, Nov 7, 2012 at 4:44 PM, Humanist Discussion Group
<willard.mccarty at mccarty.org.uk> wrote:
>                  Humanist Discussion Group, Vol. 26, No. 460.
>             Department of Digital Humanities, King's College London
>                        www.digitalhumanities.org/humanist
>                 Submit to: humanist at lists.digitalhumanities.org
>
>
>
>         Date: Tue, 6 Nov 2012 23:06:12 +0000
>         From: Tom Salyers <tom.d.salyers at gmail.com>
>         Subject: What distance measure should I be using for string similarity?
>
> Here's the executive summary: I'm trying to cluster sentences from
> about twenty Elizabethan plays together based on how similar their
> grammatical structures are. To that end, I've compiled a database of
> the sentences from an XML corpus that has each word tagged with its
> part of speech. For instance, the sentence "Now Faustus, what wouldst
> thou have me do?" has the structure "av np pu q vm p vh p vd pu".
>
> So far, so good. The problem is that since sentences are such
> flexible, modular things, there's no hard-and-fast way to assign a
> sentence into a particular category. What I've finally settled on is
> clustering to assign sentences to categories by their similarity--most
> likely k-medoid clustering, since my original approach, hierarchical
> agglomerative clustering, was hugely time-consuming. (On the order of
> O of n^2.)
>
> My problem arises when trying to compute similarities and/or distances
> between the sentences. I originally was trying Levenshtein distance,
> but it seems to be skewing the results for short but
> structurally-different sentences, even after I reduced the
> part-of-speech tags to single alphanumeric characters to eliminate
> noise from different-length tags. For instance, I'm getting "Fie,
> Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put
> in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"),
> which shouldn't really be happening--but the edit distance is so much
> smaller between them and the longer sentences that they're getting
> dropped into the same bucket.
>
> I've started toying around with things like cosine similarity, and to
> that end have reduced my sentences to n-dimensional
> frequency-of-occurrence vectors for each POS tag...but I'm wondering
> if there's a better measure out there that I just haven't heard of.
> Can anyone point me in the right direction? Thanks in advance, and
> please let me know if you need more details.
>
> --
> Tom Salyers



More information about the Humanist mailing list