Jon Aquino's Mental Garden

Engineering beautiful software jon aquino labs | personal blog

Tuesday, March 29, 2005

Mashing up Y!Q with Google Directory

What if we could use existing web services to perform automated categorization of blog entries? It turns out that we can achieve this using Y!Q and Google Directory -- the former to extract the most important words from a paragraph, and the latter to provide categories for these words.

There are two steps:

1. Extract keywords from each post using Yahoo's beta Y!Q service, which lets you search on one or more paragraphs of text (yes, as input).

2. Plug those keywords into Google Directory to find out what categories they belong to.

For example, take the following post which I wrote a few weeks ago:
My hope is that this Meetup.com group will be a means to find other Victoria residents who have a passion for cutting-edge technology. I found this great list of the world A-list digerati. Brewster Kahle is The Searcher, Kevin Kelly is the Saint, Bill Gates is The Software Developer, Esther Dyson is The Pattern-Recognizer, etc. Who are the digerati in beautiful, unassuming Victoria BC?
Y!Q gives me the following keywords. I'm guessing the numbers represent relevance scores, out of 100:
esther dyson:100 | digerati:99.553 | victoria bc:51.223 | brewster kahle:50.773 | kevin kelly:50.756 | cutting edge technology:50.252 | software developer:50.207 | bill gates:50.18 | cutting edge:50.053 | edge technology:49.751 | passion:49.059 | meetup:23.262 | pattern recognizer:1.487 | list:0.798 | saint bill:0.498 | searcher:0.498 | unassuming:0.498 | dyson:0.497 | residents:0.491 | victoria:0.01
Now let's look at the category that Google Directory gives me for "esther dyson":
Science/Social_Sciences/Political_Science/Public_Policy/Ecommerce_Policy
I would love to extract the Google Directory categories for all 24300 keywords that Y!Q extracted from my blog, but unfortunately it limits me to 1000 queries per day. So I'm just taking the highest-ranking ones for now.

That's the beauty of web services -- they can be combined in interesting ways. Web-service mashups is a meme that is currently building momentum.

Update: Here are some preliminary results of the automated analysis of my blog, for my favourite interests, using the 1000 highest-rated keywords from Y!Q:

226 Computers
108 Arts
83 Society
73 Computers/Software
72 Business
51 Regional
48 Computers/Programming
46 Games
45 Society/Religion_and_Spirituality
41 Science
39 Society/Religion_and_Spirituality/Christianity
38 Shopping
38 Arts/Music
37 Computers/Internet
34 Reference
33 Business/Industries
32 Computers/Programming/Languages
30 World
30 Recreation
26 Computers/Data_Formats
25 Regional/North_America
22 Computers/Software/Internet
22 Computers/Data_Formats/Markup_Languages
20 Games/Card_Games
20 Computers/Data_Formats/Markup_Languages/HTML
19 Games/Card_Games/Special_Decks/Guillotine
19 Games/Card_Games/Special_Decks
18 Computers/Data_Formats/Markup_Languages/HTML/References
17 Reference/Education
17 Arts/Music/Styles
15 Society/Religion_and_Spirituality/Christianity/Denominations
15 Science/Social_Sciences
15 Regional/Europe
15 Computers/Programming/Languages/Java
14 Computers/Software/Internet/Clients
14 Arts/Movies
13 Regional/North_America/United_States
13 Arts/Literature
12 Society/Religion_and_Spirituality/Christianity/Denominations/Catholicism
12 Regional/North_America/Canada
12 Home
12 Business/Management
11 Regional/Europe/United_Kingdom
11 Computers/Internet/On_the_Web
10 Kids_and_Teens
10 Computers/Software/Operating_Systems
10 Computers/Open_Source
10 Business/Major_Companies/Publicly_Traded
10 Business/Major_Companies
10 Arts/Animation

Another way we can browse this information is using a tree. Here I'm using Microsoft's free XML Notepad. The number of blog posts is given for each node:



The implication of this process is that you can analyze anyone's blog and "browse their brain-tree" to see if you have similar interests as they do. Taking it a step further, you could probably find some statistical measure of how well your brain-tree intersects their brain-tree. Call it automated introspection (or automated extrospection?).

From the above information I have hand picked a few areas of interest that I am especially interested in:
  • Arts/Graphic_Design/Typography
  • Computers/Human_Computer_Interaction
  • Computers/Internet/On_the_Web/Weblogs
  • Computers/Internet/Searching
  • Computers/Open_Source
  • Computers/Programming/Languages
  • Computers/Programming/Languages/Ruby
  • Computers/Programming/Languages/Smalltalk
  • Computers/Programming/Methodologies
  • Computers/Software/Databases/Data_Mining/
  • Computers/Software/Editors
  • Computers/Software/Fonts
  • Computers/Software/Freeware
  • Computers/Software/Internet/Servers/Collaboration
  • Computers/Systems/Handhelds
  • Consumer_Electronics
  • Reference/Knowledge Management/Knowledge Discovery/Data Mining
  • Reference/Knowledge Management/Knowledge Discovery/Text Mining
  • Reference/Knowledge Management/Knowledge Discovery/Information Visualization
  • Regional/North_America/Canada/British_Columbia/Localities/V/Victoria
  • Science/Math/Statistics
  • Science/social_sciences/Psychology/industrial_and_organizational/human_factors_and_ergonomics
  • Science/Social_Sciences/Psychology/Self-Help
  • Society/Religion_and_Spirituality/Christianity/Denominations/Catholicism/Prayer_and_Spirituality
  • Society/Subcultures/Geeks_and_Nerds

Update, April 6: No need to screen-scrape Y!Q anymore -- they have released an official API for accessing their "term-extraction" service.

0 Comments:

Post a Comment

<< Home