Jon Aquino's Mental Garden

Engineering beautiful software jon aquino labs | personal blog

Tuesday, March 08, 2005

Do-It-Yourself Semantic Clustering of Tags using Google Directory

There has recently been a surge of articles about using the Porter Stemming Algorithm to find similar tags by similar etymology. Well how about clustering tags by similar meaning? Well, that's a hard problem, you say.

Or is it? It turns out that Google Directory gives us hierarchical categories for any given keyword. For example, the word "coding" is categorized under Science/Math/Applications/Communication_Theory/Coding_Theory and Computers/Programming/Languages/Java/Coding_Standards:



So if we want to organize del.icio.us and Flickr tags semantically i.e. if we want to cluster tags together, we can simply ask Google Directory for the categories of each tag.

In fact, I have implemented this idea using screen scraping to get the information from Google Directory. (Hope you won't mind too much, Google!). I have written a script to extract the most frequently occurring tags for my del.icio.us links, and a second script to cluster the tags together into categories scraped from Google Directory. The end result is a list of categories representing my interests. Check it out!

Update: A neat side effect of using Google Directory is that it even understands tags from other languages. For instance, someone tagged one of my links with "desenvolvimento". Now, I don't know what this word means, but Google Directory still gave me some clusters for it: World, Português/Regional/Brasil/Governo/Ministérios e Agências:

1 Comments:

Post a Comment

<< Home