Calculating WordNet Synset similarity
Synsets are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the Synsets it contains. The closer the two Synsets are in the tree, the more similar they are.
How to do it...
If you were to look at all the hyponyms of reference_book
(which is the hypernym of cookbook
), you'd see that one of them is instruction_book
. This seems intuitively very similar to a cookbook
, so let's see what WordNet similarity has to say about it with the help of the following code:
>>> from nltk.corpus import wordnet >>> cb = wordnet.synset('cookbook.n.01') >>> ib = wordnet.synset('instruction_book.n.01') >>> cb.wup_similarity(ib) 0.9166666666666666
So they are over 91% similar!
How it works...
The wup_similarity
method is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym:
>>> ref = cb.hypernyms()[0] >>> cb.shortest_path_distance(ref) 1 >>> ib.shortest_path_distance(ref) 1 >>> cb.shortest_path_distance(ib) 2
So cookbook
and instruction_book
must be very similar, because they are only one step away from the same reference_book
hypernym, and, therefore, only two steps away from each other.
There's more...
Let's look at two dissimilar words to see what kind of score we get. We'll compare dog
with cookbook
, two seemingly very different words.
>>> dog = wordnet.synsets('dog')[0] >>> dog.wup_similarity(cb) 0.38095238095238093
Wow, dog
and cookbook
are apparently 38% similar! This is because they share common hypernyms further up the tree:
>>> sorted(dog.common_hypernyms(cb)) [Synset('entity.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('whole.n.02')]
Comparing verbs
The previous comparisons were all between nouns, but the same can be done for verbs as well:
>>> cook = wordnet.synset('cook.v.01') >>> bake = wordnet.0('bake.v.02') >>> cook.wup_similarity(bake) 00.6666666666666666
The previous Synsets were obviously handpicked for demonstration, and the reason is that the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns can be traced up to the hypernym object
, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate the similarity. For example, if you were to use the Synset for bake.v.01
in the previous code, instead of bake.v.02
, the return value would be None
. This is because the root hypernyms of both the Synsets are different, with no overlapping paths. For this reason, you also cannot calculate the similarity between words with different parts of speech.
Path and Leacock Chordorow (LCH) similarity
Two other similarity comparisons are the path similarity and the LCH similarity, as shown in the following code:
>>> cb.path_similarity(ib) 0.3333333333333333 >>> cb.path_similarity(dog) 0.07142857142857142 >>> cb.lch_similarity(ib) 2.538973871058276 >>> cb.lch_similarity(dog) 0.9985288301111273
As you can see, the number ranges are very different for these scoring methods, which is why I prefer the wup_similarity
method.
See also
The recipe on Looking up Synsets for a word in WordNet has more details about hypernyms and the hypernym tree.