Kaggler. Data Scientist. Father of Five.
In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.
However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.
The specifications of the new system that we built are as follows:
Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.
You can find the full part lists here.
This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.
I am grateful for all these opportunities to share what I enjoy with the data scientist community.
I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.
My talk is outlined as follows:
You can find latest slides here:
Originally posted at conversionlogic.com
There is significant debate in the data science community around the most important ingredients for attaining accurate results from predictive models. Some claim that it’s all about the quality and/or quantity of data, that you need a certain size data set (typically large) of a particular quality (typically very good) in order to get meaningful outputs. Others focus more on the models themselves, debating the merits of different single models – deep learning, gradient boosting machine, Gaussian process, etc. – versus a combined approach like the Ensemble Method.
Like the models they create, not all data scientists are created equal. I am less interested in who is “smarter” or has a better education, and more in how competitive and dedicated the modeler is. Most marketers don’t question the qualifications of a data science team because they expect that given good data and a solid algorithmic approach, they will achieve good predictive performance. At the very least, performance across different modelers should be comparable. Unfortunately, that’s not always the case.
In his New York Times bestseller Superforecasting, Prof. Philip Tetlock at University of Pennsylvania showed that, at the Intelligence Advanced Research Projects Activity (IARPA) tournament, the performance of “superforecasters” was 50% better than standard, and 30% better than those with access to secret data. This clearly demonstrates that the people doing the modeling, not the data or the models themselves, make a huge difference.
More relevant to predictive modeling specifically, KDD, one of most prestigious data science conferences, has hosted an annual predictive modeling competition, KDD Cup, since 1997. It attracts participants from top universities, companies, and industries around the world. Although every team is given exactly the same data set, and is familiar with same state-of-the-art algorithms, the resulting performances vary wildly across teams. Last year, the winning team achieved 91% accuracy while over 100 teams remained below 63% accuracy, 30% lower than the best score.
Both of these examples show the importance of not just the “how,” but the “who” when it comes to predictive modeling. This isn’t always the easiest thing for marketers to assess, but should definitely be taken into consideration when evaluating predictive analytics solutions. Ask about the data, and the models and methodology, but don’t forget the modelers themselves. The right data scientists can make all the difference to the success of your predictive program.
I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.
Let me start with my setup.
I have access to 2 machines:
I purchased the desktop from eBay around at $2,000 a year ago (September 2014).
As the code repository and version control system, I use git.
It’s useful for collaboration with other team members. It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.
It’s useful even when I work by myself too. It helps me reuse and improve the code from previous competitions I participated in before.
I use S3 to share files between my machines. It is cheap – it costs me about $0.1 per month on average.
I use Dropbox to share files between team members.
For flow control or pipelining, I use makefiles (or GNU
It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.
For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.
# directories DIR_DATA := data DIR_BUILD := build DIR_FEATURE := $(DIR_BUILD)/feature DIR_VAL := $(DIR_BUILD)/val DIR_TST := $(DIR_BUILD)/tst ... DATA_TRN := $(DIR_DATA)/train.csv DATA_TST := $(DIR_DATA)/test.csv ... Y_TRN := $(DIR_DATA)/y.trn.yht ... $(Y_TRN): $(DATA_TRN) cut -d, -f2 $< | tail -n +2 > [email protected]
Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).
include Makefile FEATURE_NAME := feature3 FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm $(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE) src/generate_feature3.py --train-file $< \ --test-file $(lastword $^) \ --train-feature-file $(FEATURE_TRN) \ --test-feature-file $(FEATURE_TST) %.ffm: %.sps src/svm_to_ffm.py --svm-file $< \ --ffm-file [email protected] \ --feature-name $(FEATURE_NAME) ...
Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.
include Makefile.feature.feature3 N = 400 DEPTH = 8 LRATE = 0.05 ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE) MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME) ... PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv all: validation submission validation: $(METRIC_VAL) submission: $(SUBMISSION_TST) retrain: clean_$(ALGO_NAME) submission $(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \ | $(DIR_VAL) $(DIR_TST) ./src/train_predict_xg.py --train-file $< \ --test-file $(word 2, $^) \ --predict-valid-file $(PREDICT_VAL) \ --predict-test-file $(PREDICT_TST) \ --depth $(DEPTH) \ --lrate $(LRATE) \ --n-est $(N) $(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST) paste -d, $(lastword $^) $< > [email protected] ...
Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.
include Makefile FEATURE_NAME := esb9 BASE_MODELS := xg_600_4_0.05_feature9 \ xg_400_4_0.05_feature6 \ ffm_30_20_0.01_feature3 \ ... PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht) PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht) FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv $(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE) paste -d, $^ > [email protected] $(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE) paste -d, $^ > [email protected]
Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in
Makefile.feature.esb9 by (1) replacing
include Makefile.feature.feature3 in
include Makefile.feature.esb9 and (2) running:
$ make -f Makefile.xg
When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).
I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop. It needs an additional system with a publicly accessible IP address. You can setup an AWS micro (or free tier) EC2 instance for it.
tmux allows you to keep your SSH sessions even when you get disconnected. It also let you split/add terminal screens in various ways and switch easily between those.
Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
# If you created a tmux session, and want to connect to it:
$ tmux attach
Then to create a new pane/window and navigate in between:
Ctrl + b + "– to split the current window horizontally.
Ctrl + b + %– to split the current window vertically.
Ctrl + b + o– to move to next pane in the current window.
Ctrl + b + c– to create a new window.
Ctrl + b + n– to move to next window.
To close a pane/window, just type exit in the pane/window.
Hope this helps.
Next up is about machine learning tools I use.
Please share your setups and thoughts too. 🙂
You can upgrade
Kaggler either by using
$ (sudo) pip install -U Kaggler
or from the source at github:
$ git fetch origin
$ git rebase origin/master
$ python setup.py build_ext --inplace
$ (sudo) python setup.py install
I haven’t had a chance to use it with real competition data yet – after the Avazu competition, I deleted whole build directory 🙁 – and I don’t have numbers for how much faster (or slower?!) it becomes after these changes yet.
I will jump into another competition soon, and let you know how it works. 🙂
This article was originally posted on Kaggle’s Avazu competition forum and reposted here with a few edits.
Here I’d like to share what I’ve put together for online learning as a Python package – named Kaggler.
You can install it with pip as follows:
$ pip install -U Kaggler
then, import algorithm classes as follows:
from kaggler.online_model import SGD, FTRL, FM, NN, NN_H2
Currently it supports 4 online learning algorithms – SGD, FTRL, FM, NN (1 or 2 ReLU hidden layers), and 1 batch learning algorithm – NN with L-BFGS AUC optimization.
This article was originally posted on ethiel.org.
Recently Prof. Konrad Koerding at Northwestern University asked for an advice on his Facebook for one of his Ph.D student, who studies Computational Neuroscience but wants to pursue his career in Data Science. It reminded me of the time I was looking for such opportunities, and shared my thoughts (now posted on the webpage of his lab here). I decide to post it here too (with a few fixes) so that it can help others.
First, I’d like to say that Data Science is a relatively new field (like Computational Neuroscience), and you don’t need to feel bad to make the transition after your Ph.D. When I was out to the job market, I didn’t have any analytic background at all either.
I started my industrial career at one of analytic consulting companies, Opera Solutions in San Diego, where one of Nicolas‘ friends, Jacob, runs the R&D team of the company. Jacob did his Ph.D under the supervision of Prof. Michael Arbib at University of Southern California in Computational Neuroscience as well. During the interview, I was tested to prove my thought process, basic knowledges in statistics and Machine Learning, and programming, which I’d practiced through out my Ph.D everyday.
So, if he has a good Machine Learning background with programming skills (I’m sure that he does, based on the fact he’s your student), he can be competent to pursue his career in Data Science.
Tools in Data Science
R is similar to MATLAB except that it’s free. It is not a hardcore programming language and doesn’t take much time to learn. It comes with the latest statistical libraries and provides powerful plotting functions. There are many IDEs, which make easy to use R, but my favorite is R Studio. If you run R on the server with R Studio Server, you can access it from anywhere via your web browser, which is really cool. Although native R plotting functions are excellent by themselves, the ggplot2 library provides more eye-catching visualization.
For Python, Numpy + Scipy packages provides similar vector-matrix computation functionalities as MATLAB. For Machine Learning algorithms, you need Scikit-Learn, and for data handling, Pandas will make your life easy. For debugging and prototyping, iPython Notebook is really handy and useful.
SQL is an old technology but still widely used. Most of data are stored in the data warehouse, which can be accessed only via SQL or SQL equivalents (Oracle, Teradata, Netezza, etc.). Postgres and MySQL are powerful yet free, so it’s perfect to practice with.
Hints for Kaggle Data Mining Competitions
Fortunately, I had a chance to work with many of top competitors such as the 1st and 2nd place teams at Netflix competitions, and learn how they do at competitions. Here are some tips I found helpful.
1. Don’t jump into algorithms too fast.
Spend enough time to understand data. Algorithms are important, but no matter how good algorithm you use, garbage-in only leads to garbage-out. Many classification/regression algorithms assume the Gaussian distributed variables, and fail to make good predictions if you provide non-Gaussian distributed variables. So, standardization, normalization, non-linear transformation, discretization, binning are very important.
2. Try different algorithms and blend.
There is no universal optimal algorithm. Most of times (if not all), the winning algorithms are ensembles of many individual models with tens of different algorithms. Combining different kinds of models can improve prediction performance a lot. For individual models, I found Random Forest, Gradient Boosting Machine, Factorization Machine, Neural Network, Support Vector Machine, logistic/linear regression, Naive Bayes, and collaborative filtering are mostly useful. Gradient Boosting Machine and Factorization Machine are often the best individual models.
3. Optimize at last.
Each competition has a different evaluation metric, and optimizing algorithms to do the best for that metric can improve your chance to win. Two most popular metrics are RMSE and AUC (area under the ROC curve). Algorithms optimizing one metric is not the optimal for the other. Many open source algorithm implementations provide only RMSE optimization, so for AUC (or other metric) optimization, you need to implement it by yourself.