Posts tagged: Modeling

Predicting the World Series using Python

By , October 29, 2009 7:45 am

Last week, I’ve started to learn Python through a peer-to-peer learning session set up through nextNY.  The material that we’ve gone through has made learning programming very easy to wrap our heads around, and the environment of cooperative learning has been awesome.  I’m looking forward to being a Python ninja* pretty soon.

With four and half chapters of Python at my disposal, I wanted to put my skills to the test.  Since I’m a huge baseball fan, I thought I’d try my hand in simulating who would lose the World Series this year, a pillow-fight match-up between the New York Yankees and the Philadelphia Phillies.

The first thing to do was to crunch the numbers.  Crunching the numbers means exactly that, figuring out the probabilities of events occurring over a seven game series.  I incorporated things like Ryan Howard’s immense strike-out rate, Derek Jeter’s incredible lack of range at shortstop, and Brad Lidge’s ninth inning ERA.  I also made sure to incorporate correlations, or how related each variable is to each other.  Funny enough, the highest correlation I found was between having a runner on first base with less that two outs in the seventh inning onwards and Arod weakly grounding into a double-play.  Numbers never lie.

Now this got me a pretty good picture of who would lose the World Series, but I hadn’t taken into consideration the qualitative variables, the intangibles, the “Cole Hamels’ is a play-off pitcher” and the “Mariano is unhittable in the World Series” bullshit bullshit.  These are usually the ‘statistics’ that overzealous fans throw out (with no meaningful data except their distorted memories) as their defense to a player’s immortality.

The classic intangible lies on the shoulders’ of the Yankee captain, Derek Jeter, a ball player that seems to find himself at the right place at the right time in the postseason.  Yankee fans have constantly spouted his ‘greatness’, and refuse to admit that he was horribly out of position on the Jeremy Giambi play at the plate, and doesn’t even register as having the highest batting average in a World Series (that designation goes to Billy Hatcher who hit a sickening .750 for the Reds in 1990 in 12 ABs).  Heck, Jeter doesn’t even deserve the nickname “Mr. November” for his play in the 2001 World Series.  He had 1 HR, 1 RBI, and 2 runs scored in November, numbers that were almost matched by a pitcher for the Arizona Diamondbacks (1 RBI and 2 runs scored).  Oh, and that pitcher also won two potentially series ending games in two days that November with a 2.22 ERA, .96 WHIP, 8Ks in 8.1 innings.  Derek Jeter, I’d like you to meet the real “Mr. November,” Randy Johnson.

Okay, so I wrote my little Python program to capture all of this.  The stats, the pseudo-stats, the Phillie Phanatic’s rants, and the countless times we’ll hear “26 World Series rings.”  With so many probabilities and interactions, this program chugged along for two days, and finally, yesterday before the first pitch, I got the result:  Value Error: Let’s Go Mets.

*Looking forward to the day when ninja is not used in start-up world employment searches and reverts back to its original awesomeness of stealthy nighttime assassin.

SmackDown Headliner – Google VS Facebook

By , June 23, 2009 12:26 pm
Me at 7, with bigger guns

Me at 7, with bigger guns

I haven’t watched WWF, or WWE, or Friday Night Smackdown since I was a kid (see right), but after reading Wired magazine’s article on Google vs. Facebook, I could not help but think about, in my opinion, the greatest wrestling match of all time.  This battle pitted the up and coming, wildly popular, eccentric and electric young superstar against the stalwart, power punching, mega-myth champion of the world.  Of course, I’m talking about the headliner at WrestleMania 6 where the Heavy Weight Champion of the World Hulk Hogan fought the Intercontinental Champ, The Ulllttiiimmmatteeeee Warrrrrioorrrrrrr!

Champion against champion, title for title, that’s what it’s all about.

Google and Facebook are waging their own war on shaping what the Internet’s future will look like.  They both have an underlying mission to share information, but their core approaches and visions of the web are very different.  Google has historically viewed the web as the great equalizer, the place where information can be accessed by anyone and everyone, and that information can be efficiently found by harnessing the power of cold, hard algorithms.  Facebook sees the web not as the source of information per say, but rather as the medium for which people can share information across their social net.  Instead of relying on complex math necessarily, Facebook puts the power of human sharing in the forefront of spreading information.

Both of these approaches have their place on the web.  What good is a platform to share information easily from the people that matter most if the people that matter the most can’t find the information in the first place, and vice verse?  In my mind, the bigger challenges lie in front of Facebook, because the future of sourcing information from hundreds of friends (if not thousands for the Facebook junkies “power users”) will come down to powerful ranking, grouping, sorting, and prioritizing algorithms, a space that Google has done very well in.

“So wha’cha gonna do brother … when the Hulkster (read as Google) comes for youuuu (read as Facebook)!”  Well, Facebook has been able to pull some ex-Googlers into their shop, to a tune of nearly 9% of their staff, and they have a virtual lock on the social network space (although I begin to worry about the hipness of it when my parent’s generation is “friending” me).  As difficult as it may seem, they may be putting together the pieces and the relationships to really challenge Google’s web dominance.  And maybe, just maybe, they’ll have enough to gorilla slam the powerhouse, avoid the leg-drop, and big splash their way to top, just like the greatest character wrestler of all time was able to do.  R.I.P. The Ultimate Warrior.

Bonus Footage:  Top Ultimate Warrior Promos Ever

Reblog this post [with Zemanta]

Pandora’s (Beat)Box

By , May 13, 2009 12:04 pm

The story of Pandora and her box goes a little something like this.  Zeus gets pretty pissed off at this guy Prometheus for throwing fire in a game of RoShamBo, and in retaliation creates this woman Pandora to punish all of mankind (OK, I’m speculating about the game of Rock-Paper-Scissor, but I’d get pretty upset if I got beat by someone using their once in a lifetime throw of fire).  Pandora was given many seductive gifts from the gods and one in particular, the gift of curiosity, led her to open a box releasing all the awfulness into the world (including the credit crisis).  Realizing what she had just done, Pandora quickly slams the box shut, trapping only Hope inside … or maybe not.  After using Pandora.com, it is easy to see why President Obama sees so much hope in the world.

Pandora is an online, streaming music player where users can “customize” their own radio stations.  It’s absolutely brilliant since the only thing you really have to do is enter in a song or artist, and Pandora will automatically stream music that is similar to that song or artist.  Now, instead of spending hours customizing the perfect 80s party playlist (or mix tape/CD for the romantics out there), we can just tell Pandora what song fits our mood at the time and get hours of music delivered too us.  Best of all, if a song comes on that we’re not sure why it was there in the first place (Blame it on the Rain made it on all my mix tapes for some reason), we can easily skip it, and Pandora will exclude songs like it.

So how does Pandora do this?  Well, they’ve hired a team of music analysts who essentially measure each song on 100+ musical characteristics, an idea inspired by the Music Genome Project.  These characteristics, or metrics, make up the “genes” of a song, and their measurements are used to construct a song vector, a mathematical attempt to value the essence of a song.  The similarity of two songs is figured out by measuring the differences between all the musical characteristic of the two songs.  To do this well, Pandora uses a complex distance function, which is essentially saying “how far apart, or different, are these two songs.”  The shorter the distance, the more similar the songs are, and the more likely that song will be played next in your Pandora station.

This is a very powerful framework, but there is one important assumption that shouldn’t be overlooked, and could be a major drawback to implementing this particular recommendation engine.  That critical assumption is that we have identified every factor that captures the je ne sais quoi of a song, which for the non-French speaking means an intangible quality that makes something distinctive.  Do you smell the conundrum brewing?  How does one measure the intangible?  Can you find all the right factors to accurately describe Kris Allen’s performance of Kayne West’s Heartless?  Now while it might be next to impossible to figure out everything that makes a song click, it is very important that you catch the most influential ones in your recommendation model.  Failure to do this could get you voted off.

Pandora is doing a pretty damn good job recommending songs using this framework and they understand that there are a lot of factors that make a song a unique piece of work.  They have developed a framework where they have identified a lot of the measurable, tangible metrics, and have used them to effectively relate songs to each one another.  The next big step in recommendation models would be to understand how each individual values a song, what aspects are more important on a case by case basis, and eventually delivering a personalized, Rathan and Rathan’s Infinite Playlist just for me.

Whatchu talkin’ about Warren

By , May 8, 2009 1:19 pm

Berkshire Hathaway had their annual shareholders’ meeting last Saturday (May 2), and Warren Buffett and Charlie Munger totally hated on “higher-order” mathematics used in finance.    Come on guys, what did little ol’ math do to you?  Math and modern portfolio theory were picked on by these investment gurus more than Arnold was picked on by the Gooch! Don’t worry math, I got your back.

The truth of the matter is while Mr. Buffet and Mr. Munger are right about Wall Street’s reliance on complex math, the real blame should be focused on the consultants and investment managers who hawked these models as the end-all, be-all, best thing since sliced bread.  This is one case where it is totally fine to shoot the messenger in the face, however, we shouldn’t abandon using math to help us make better decisions.  We just need to find a better translator, because the message has some very valuable insights.

The reason why we build financial models, or really any models, is to keep track of numerous and complex relationships, something that is very difficult to do in our heads.  The world does not move in simple, predictable ways and the real value in modeling frameworks is to find the best representation of how the world actually behaves.  Sometimes a simple relationship just doesn’t make sense; Mr. Buffet would surely agree that modeling investment growth as a simple linear change is not nearly a good as modeling it as an exponential change (there are a number of high school curriculum that consider this “higher math”).

The key is to fully understand and make transparent that as we increase complexities in models, we increase the number of things that can go wrong, and therefore decrease our certainty of performance.  Think back to our first calculator, which for a lot of us often doubled as our first watch (wicked).  Simple, easy, and reliable.  Now add in a 2.66Ghz Intel Processor, 8GB RAM, 320GB of Storage, and a super-fly, aluminum cased, glow in the dark keyboard.  We have a kick-ass laptop that let’s us do all sorts of things a whole lot better, but it’s not surprising that its average lifespan is somewhere around 2 – 4 years.  And when it goes, we lose everything (yes, even that awesome illegally downloaded music collection that was the envy of our less tech savvy and risk adverse friends).  The funny thing is that Casio can still multiply two five-digit numbers, even after 20+ years!  But that doesn’t make it better.

Unfortunately, the certainty of performance only really bothers us in the worst of times, like when our computers crash and the stock market collapses.  Now, just like backing up our hard-drives, there are ways that we can create more security around financial modeling.  A few things that come to mind are good stress testing frameworks (if your models can’t do this easily for you, then be very cautious with its results), putting good translators (i.e., people who get how the model works AND understand its limitations) in front of decision makers early and often, and moving to a risk-based incentive compensation model (a discussion for another time).

Modeling frameworks are very useful, but they shouldn’t be used as a reason to stop thinking about what we are doing.  The human element in analyzing data can never be replaced by a pure modeling framework.  We shouldn’t site blantent disregard of rational thought by high-paid consultants and star investment analysts as failures in mathematical modeling.  Because remember, when you point your finger at your model, there are three fingers pointing back at you … wait for it  …. wait for it … okay, you got it, cool.

Panorama Theme by Themocracy