Predictive Modeling: Why the “who” is just as important as the “how”

by Jeong-Yoon Lee

There is significant debate in the data science community around the most important ingredients for attaining accurate results from predictive models. Some claim that it’s all about the quality and/or quantity of data, that you need a certain size data set (typically large) of a particular quality (typically very good) in order to get meaningful outputs. Others focus more on the models themselves, debating the merits of different single models – deep learning, gradient boosting machine, Gaussian process, etc. – versus a combined approach like the Ensemble Method.

I think that both of these positions have some truth. While it’s not as simple as “more data, better results” (see cases from Twitter and Netflix showing that the volume of data was almost meaningless to predictive accuracy), nor is the model itself the only predictor of success, all of those elements do play a role in how precise the results will be. But there is another factor that is almost always overlooked: the modelers themselves. Like the models they create, not all data scientists are created equal. I am less interested in who is “smarter” or has a better education, and more in how competitive and dedicated the modeler is. Most marketers don’t question the qualifications of a data science team because they expect that given good data and a solid algorithmic approach, they will achieve good predictive performance. At the very least, performance across different modelers should be comparable. Unfortunately, that’s not always the case.

In his New York Times bestseller Superforecasting, Prof. Philip Tetlock at University of Pennsylvania showed that, at the Intelligence Advanced Research Projects Activity (IARPA) tournament, the performance of “superforecasters” was 50% better than standard, and 30% better than those with access to secret data. This clearly demonstrates that the people doing the modeling, not the data or the models themselves, make a huge difference.

More relevant to predictive modeling specifically, KDD, one of most prestigious data science conferences, has hosted an annual predictive modeling competition, KDD Cup, since 1997. It attracts participants from top universities, companies, and industries around the world. Although every team is given exactly the same data set, and is familiar with same state-of-the-art algorithms, the resulting performances vary wildly across teams. Last year, the winning team achieved 91% accuracy while over 100 teams remained below 63% accuracy, 30% lower than the best score.

Both of these examples show the importance of not just the “how,” but the “who” when it comes to predictive modeling. This isn’t always the easiest thing for marketers to assess, but should definitely be taken into consideration when evaluating predictive analytics solutions. Ask about the data, and the models and methodology, but don’t forget the modelers themselves. The right data scientists can make all the difference to the success of your predictive program.