Quora: How many employed data scientists are able to solve problems from online competitions such as Kaggle’s?

by Jeong-Yoon Lee

Before reading further, please watch this video (only 1m 47s long), which shows how an average man compares to a football player at 40 yard dash.

When I talk to data science professionals, especially senior ones with more experience, I often encounter optimism on one’s competitiveness - “I know what I am doing and can build good models at work - maybe better than others”.

Online competitions provide objective measures for at least a few criteria, such as prediction accuracy, time to build a good model, reproducibility, etc.

For most data scientists, including myself, working on competitions is a reality check and humbling experience:

At the Intelligence Advanced Research Projects Activity (IARPA) tournament, the performance of “superforecasters” was 50% better than other forecasters, and 30% better than even those with access to secret data ¹.
At KDD Cup 2015, the winning teams achieved over 90% accuracy while over 100 teams remained around 60% accuracy, 30% lower than the best score ².
At Criteo Display Advertising Challenge, the benchmark solution provided by a well respected domain expert was outperformed by a simple 100+ lines of Python code written by a Kaggle user, tinrtgu.

Long tenure doesn’t guarantee superior performance. As summarized by Dr. Ericsson in his bestseller book, Peak, the doctor, teacher, or driver with twenty years of experiences is likely to be worse than the one with only five because one’s performance deteriorates gradually with years of routine/automated work in the absence of deliberate efforts to improve.

Going back to the original question, employed data scientists “without learning from competitions” are likely to do very poorly on competitions.

The learning doesn’t need to come from participating in competitions. Out of 1MM+ Kaggle users, only 65K+ participate in competitions, while others learn cutting edge algorithms and best practices from tutorials, solutions shared by others, working on open data sets, etc.

Whenever I talk to someone who discounts the benefits of competitions without having a single competition experience, and yet is very confident on her/his modeling capability, I can’t help thinking about the average-man-vs-football-player video above, and just smile. :)

Competing against 0.1% improvement in accuracy? It’s like criticizing that Olympian 100m sprinters compete for 0.1 sec. That’s not for most of us. Don’t worry about it until you get close. We have much longer way to go.

Quora: How many employed data scientists are able to solve problems from online competitions such as Kaggle’s?

Footnotes