Category: Recommendation Models

Predicting the World Series using Python

By Rathan Haran, October 29, 2009 7:45 am

Last week, I’ve started to learn Python through a peer-to-peer learning session set up through nextNY.  The material that we’ve gone through has made learning programming very easy to wrap our heads around, and the environment of cooperative learning has been awesome.  I’m looking forward to being a Python ninja* pretty soon.

With four and half chapters of Python at my disposal, I wanted to put my skills to the test.  Since I’m a huge baseball fan, I thought I’d try my hand in simulating who would lose the World Series this year, a pillow-fight match-up between the New York Yankees and the Philadelphia Phillies.

The first thing to do was to crunch the numbers.  Crunching the numbers means exactly that, figuring out the probabilities of events occurring over a seven game series.  I incorporated things like Ryan Howard’s immense strike-out rate, Derek Jeter’s incredible lack of range at shortstop, and Brad Lidge’s ninth inning ERA.  I also made sure to incorporate correlations, or how related each variable is to each other.  Funny enough, the highest correlation I found was between having a runner on first base with less that two outs in the seventh inning onwards and Arod weakly grounding into a double-play.  Numbers never lie.

Now this got me a pretty good picture of who would lose the World Series, but I hadn’t taken into consideration the qualitative variables, the intangibles, the “Cole Hamels’ is a play-off pitcher” and the “Mariano is unhittable in the World Series” bullshit bullshit.  These are usually the ’statistics’ that overzealous fans throw out (with no meaningful data except their distorted memories) as their defense to a player’s immortality.

The classic intangible lies on the shoulders’ of the Yankee captain, Derek Jeter, a ball player that seems to find himself at the right place at the right time in the postseason.  Yankee fans have constantly spouted his ‘greatness’, and refuse to admit that he was horribly out of position on the Jeremy Giambi play at the plate, and doesn’t even register as having the highest batting average in a World Series (that designation goes to Billy Hatcher who hit a sickening .750 for the Reds in 1990 in 12 ABs).  Heck, Jeter doesn’t even deserve the nickname “Mr. November” for his play in the 2001 World Series.  He had 1 HR, 1 RBI, and 2 runs scored in November, numbers that were almost matched by a pitcher for the Arizona Diamondbacks (1 RBI and 2 runs scored).  Oh, and that pitcher also won two potentially series ending games in two days that November with a 2.22 ERA, .96 WHIP, 8Ks in 8.1 innings.  Derek Jeter, I’d like you to meet the real “Mr. November,” Randy Johnson.

Okay, so I wrote my little Python program to capture all of this.  The stats, the pseudo-stats, the Phillie Phanatic’s rants, and the countless times we’ll hear “26 World Series rings.”  With so many probabilities and interactions, this program chugged along for two days, and finally, yesterday before the first pitch, I got the result:  Value Error: Let’s Go Mets.

*Looking forward to the day when ninja is not used in start-up world employment searches and reverts back to its original awesomeness of stealthy nighttime assassin.

Rediscover Discovery

By Rathan Haran, August 10, 2009 10:07 am

Discovery seems to be one of those things that we don’t really spend a lot of time doing, but when it happens, it’s like our world could not exist as it once did. Whether it’s great discoveries of mankind (oh snap, the world is round) or small discoveries made by each of us everyday (Baoguette – Vietnamese sandwich shop on 25th and Lexington), the result is the same … totally mind changing.

How we find things on the web, or rather how we think the best way to find things on the web has changed dramatically.  The initial thoughts were to build web portals, where anything and everything you wanted to know would be conveniently located on a single web destination.  Our one stop shop for discovery.  That quickly changed when consumers were not able to find the best information in one place, and instead scoured the web for sources they personally found valuable.  As content became easier to publish, resident experts were now making places on the web to get very specialized information, and we needed a new way to find things.

Hello Yahoo, I’d like to introduce you to Google.  Google – “We’ve got this great algorithm that scours the entire web and returns the best result. You can buy us for $1M.”  Yahoo – “Well that’s silly, why would we want people to leave our site?  We want people to stay on our site for as long as possible.  Carry on now, nothing to see here.”  (I’m still shocked that Yahoo would pass on this; if not for the not so obvious revenue model, but just to use Google’s search to discover great content to add to their own portal!)

So spurned by Yahoo, Google acquires a small company called AdSense and the rest is somewhat recent history.  Google really capitalized by making a better way for users to find things.  They served up what people were looking for better (PageRank) and linked it with people providing what users were searching for.  Awesome, so now people can find out where to get things that they are looking for.

Often discovery is more than just finding where to get what you are looking for.  Discovery is often a question of “what do I want,” versus “where do I get it.”  Search does a great job of solving the “where”question (Where is Boaguette located?), but it’s not so great for the “I could really go for something sweet & spicy, but not served in a sauce with rice.”   To solve that, we look to the experiences of others, and we count on their recommendations to discover new things.  Search will always have it’s place, but taking good ol’ word of mouth and placing that in a useful platform on the web will open up all kinds of new doors of discovery for everyone.

Recommendation models have come a long way, but there is still work to be done.  NetFlix recently ended their competition to improve their recommendation algorithm and will pay the winning team $1M.  There is a huge amount of value in this space, and companies are recognizing that more and more each day.  How do we capture that on the web?  A difficult problem to solve, but one with great rewards.

I’ll leave you with a quote from Greg Linden, the man behind Amazon’s recommendation system, that pretty much sums it up.  He said, “Whoever manages to change the nature of content display on the Web from a search problem to a recommender problem will reap tremendous rewards.”  It makes a lot of sense when you think about where you got your last great restaurant recommendation.

Recommendations on Facebook?

By Rathan Haran, May 17, 2009 12:51 pm

Has Facebook finally gotten it?  With the amount of personal data and interactions that they are able to collect, it should be a no-brainer for them to work on a recommendation framework.  With the advent of user-generated content, we are in information over-load at this point (just look at how many unread emails you have, or how many article feeds you end up ignoring).  The next big move needs to be a way to sort through all of this information and bubble up things that we care about as individuals.

Facebook is in a great place to do this because of their unique position in that people see a ton of value in declaring a lot about themselves over their platform (maybe a bit too much when you are “no longer in a relationship” – broken heart).   The popular a/s/l (age, sex, location) back in the day in AOL chat rooms was one of the first forms of public declaration on the web and this basic desire/need to share information about ourselves has not changed as the web has evolved.  Facebook has created a place where people want to do this, and that could be the most important and valuable thing they have to offer.

I recently met with some folks at Yahoo and noted a particularly interesting thing about users inputting accurate information.  There are tens of thousands registered users at Yahoo that have the zip code 12345.  Now the funny thing about this zip code is that it is for a small county in New York where there are 10 registered business and way less than 10,000 inhabitants!  A Yahoo user doesn’t care to share where they live because there is no good, compelling reason to do it, while a Facebook user wants to share that information because it is valuable to their social network.  I wonder how many people on Yahoo live in 90210.  (On a side note, I found this information on Wolfram Alpha which is a statistical gold mine for anyone who loves numbers.)

Since Facebook has already gotten people to buy into sharing their information, they would have to be crazy not to work with this information to provide better services, including personalized recommendations.  But alas, it appears that this feature is just a way to condense duplicate posts so users can’t spam their friends newsfeeds.  Another tease.  It’s pretty sad to say that the best analysis of online social interactions, and the most entertaining, has been Dateline NBC’s To Catch a Predator.  Yea, I’m sure you were just there to help with some homework.

Pandora’s (Beat)Box

By Rathan Haran, May 13, 2009 12:04 pm

The story of Pandora and her box goes a little something like this.  Zeus gets pretty pissed off at this guy Prometheus for throwing fire in a game of RoShamBo, and in retaliation creates this woman Pandora to punish all of mankind (OK, I’m speculating about the game of Rock-Paper-Scissor, but I’d get pretty upset if I got beat by someone using their once in a lifetime throw of fire).  Pandora was given many seductive gifts from the gods and one in particular, the gift of curiosity, led her to open a box releasing all the awfulness into the world (including the credit crisis).  Realizing what she had just done, Pandora quickly slams the box shut, trapping only Hope inside … or maybe not.  After using Pandora.com, it is easy to see why President Obama sees so much hope in the world.

Pandora is an online, streaming music player where users can “customize” their own radio stations.  It’s absolutely brilliant since the only thing you really have to do is enter in a song or artist, and Pandora will automatically stream music that is similar to that song or artist.  Now, instead of spending hours customizing the perfect 80s party playlist (or mix tape/CD for the romantics out there), we can just tell Pandora what song fits our mood at the time and get hours of music delivered too us.  Best of all, if a song comes on that we’re not sure why it was there in the first place (Blame it on the Rain made it on all my mix tapes for some reason), we can easily skip it, and Pandora will exclude songs like it.

So how does Pandora do this?  Well, they’ve hired a team of music analysts who essentially measure each song on 100+ musical characteristics, an idea inspired by the Music Genome Project.  These characteristics, or metrics, make up the “genes” of a song, and their measurements are used to construct a song vector, a mathematical attempt to value the essence of a song.  The similarity of two songs is figured out by measuring the differences between all the musical characteristic of the two songs.  To do this well, Pandora uses a complex distance function, which is essentially saying “how far apart, or different, are these two songs.”  The shorter the distance, the more similar the songs are, and the more likely that song will be played next in your Pandora station.

This is a very powerful framework, but there is one important assumption that shouldn’t be overlooked, and could be a major drawback to implementing this particular recommendation engine.  That critical assumption is that we have identified every factor that captures the je ne sais quoi of a song, which for the non-French speaking means an intangible quality that makes something distinctive.  Do you smell the conundrum brewing?  How does one measure the intangible?  Can you find all the right factors to accurately describe Kris Allen’s performance of Kayne West’s Heartless?  Now while it might be next to impossible to figure out everything that makes a song click, it is very important that you catch the most influential ones in your recommendation model.  Failure to do this could get you voted off.

Pandora is doing a pretty damn good job recommending songs using this framework and they understand that there are a lot of factors that make a song a unique piece of work.  They have developed a framework where they have identified a lot of the measurable, tangible metrics, and have used them to effectively relate songs to each one another.  The next big step in recommendation models would be to understand how each individual values a song, what aspects are more important on a case by case basis, and eventually delivering a personalized, Rathan and Rathan’s Infinite Playlist just for me.

Panorama theme by Themocracy