[Humanist] 26.460 what distance measure?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Wed Nov 7 07:44:19 CET 2012

                 Humanist Discussion Group, Vol. 26, No. 460.
            Department of Digital Humanities, King's College London
                Submit to: humanist at lists.digitalhumanities.org

        Date: Tue, 6 Nov 2012 23:06:12 +0000
        From: Tom Salyers <tom.d.salyers at gmail.com>
        Subject: What distance measure should I be using for string similarity?

Here's the executive summary: I'm trying to cluster sentences from
about twenty Elizabethan plays together based on how similar their
grammatical structures are. To that end, I've compiled a database of
the sentences from an XML corpus that has each word tagged with its
part of speech. For instance, the sentence "Now Faustus, what wouldst
thou have me do?" has the structure "av np pu q vm p vh p vd pu".

So far, so good. The problem is that since sentences are such
flexible, modular things, there's no hard-and-fast way to assign a
sentence into a particular category. What I've finally settled on is
clustering to assign sentences to categories by their similarity--most
likely k-medoid clustering, since my original approach, hierarchical
agglomerative clustering, was hugely time-consuming. (On the order of
O of n^2.)

My problem arises when trying to compute similarities and/or distances
between the sentences. I originally was trying Levenshtein distance,
but it seems to be skewing the results for short but
structurally-different sentences, even after I reduced the
part-of-speech tags to single alphanumeric characters to eliminate
noise from different-length tags. For instance, I'm getting "Fie,
Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put
in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"),
which shouldn't really be happening--but the edit distance is so much
smaller between them and the longer sentences that they're getting
dropped into the same bucket.

I've started toying around with things like cosine similarity, and to
that end have reduced my sentences to n-dimensional
frequency-of-occurrence vectors for each POS tag...but I'm wondering
if there's a better measure out there that I just haven't heard of.
Can anyone point me in the right direction? Thanks in advance, and
please let me know if you need more details.

Tom Salyers

More information about the Humanist mailing list