Kaggler. Data Scientist.
In 2014, I shared the specifications of a 6-core 64GB RAM desktop system that I purchased at around $2,000. Since then, I added NVidia Titan X to it for deep learning at additional $1,000, and it served me well.
However, as other team members started joining me on data science competitions and deep learning competitions got more popular, my team decided to build a more powerful desktop system.
The specifications of the new system that we built are as follows:
Total cost including tax and shipping was around $7,000. Depending on the budget, you can go down to 2 (-$1,520) 1080 Ti GPU cards instead of 4, or 64GB (-$399) instead of 128GB RAM, and still have a decent system.
You can find the full part lists here.
This year I had several occasions to give my “Winning Data Science Competitions” talk – at Microsoft, KSEA-SWC 2017, USC Applied Statistics Club, Spark SC, and Whisper.
I am grateful for all these opportunities to share what I enjoy with the data scientist community.
I truly believe that working on competitions on a regular basis can make us better data scientists. Hope my talk and slides help other data scientists.
My talk is outlined as follows:
You can find latest slides here:
Congrats for winners and top performers, and thanks for great sharing to all contributors in the forum. It’s always a humbling experience to compete at Kaggle. I learn so much at every competition from a lot of fellow kagglers.
Here I’d like to share my code base and notes for the competition:
My friends and I have been using the framework based on Makefiles for competitions for years now and it has worked great so far.
Introduction to the framework is available on the TalkingData forum:
Our previous code repo for past competitions are also available at:
Hope it’s helpful.
Our code is available here: https://gitlab.com/mbay/bosch
and internal LB is available here: https://gitlab.com/mbay/bosch/wikis/home
Hi, I’d like to share my team, ensemble’s solution and framework.
The code is available at gitlab:
and team’s internal LB is available here:
We joined the competition late, and had just enough time to build and run the end-to-end framework without much feature engineering. So feature-wise, there is nothing fancy, but I hope that you can find the framework itself helpful. 🙂
As you can see, it uses Makefiles to pipeline feature generation, single model training, and ensemble training. The main benefits of our framework based on Makefiles are:
If you are new to Makefiles, here are some references:
I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.
Let me start with my setup.
I have access to 2 machines:
I purchased the desktop from eBay around at $2,000 a year ago (September 2014).
As the code repository and version control system, I use git.
It’s useful for collaboration with other team members. It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.
It’s useful even when I work by myself too. It helps me reuse and improve the code from previous competitions I participated in before.
I use S3 to share files between my machines. It is cheap – it costs me about $0.1 per month on average.
I use Dropbox to share files between team members.
For flow control or pipelining, I use makefiles (or GNU
It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.
For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.
# directories DIR_DATA := data DIR_BUILD := build DIR_FEATURE := $(DIR_BUILD)/feature DIR_VAL := $(DIR_BUILD)/val DIR_TST := $(DIR_BUILD)/tst ... DATA_TRN := $(DIR_DATA)/train.csv DATA_TST := $(DIR_DATA)/test.csv ... Y_TRN := $(DIR_DATA)/y.trn.yht ... $(Y_TRN): $(DATA_TRN) cut -d, -f2 $< | tail -n +2 > [email protected]
Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).
include Makefile FEATURE_NAME := feature3 FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm $(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE) src/generate_feature3.py --train-file $< \ --test-file $(lastword $^) \ --train-feature-file $(FEATURE_TRN) \ --test-feature-file $(FEATURE_TST) %.ffm: %.sps src/svm_to_ffm.py --svm-file $< \ --ffm-file [email protected] \ --feature-name $(FEATURE_NAME) ...
Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.
include Makefile.feature.feature3 N = 400 DEPTH = 8 LRATE = 0.05 ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE) MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME) ... PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv all: validation submission validation: $(METRIC_VAL) submission: $(SUBMISSION_TST) retrain: clean_$(ALGO_NAME) submission $(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \ | $(DIR_VAL) $(DIR_TST) ./src/train_predict_xg.py --train-file $< \ --test-file $(word 2, $^) \ --predict-valid-file $(PREDICT_VAL) \ --predict-test-file $(PREDICT_TST) \ --depth $(DEPTH) \ --lrate $(LRATE) \ --n-est $(N) $(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST) paste -d, $(lastword $^) $< > [email protected] ...
Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.
include Makefile FEATURE_NAME := esb9 BASE_MODELS := xg_600_4_0.05_feature9 \ xg_400_4_0.05_feature6 \ ffm_30_20_0.01_feature3 \ ... PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht) PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht) FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv $(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE) paste -d, $^ > [email protected] $(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE) paste -d, $^ > [email protected]
Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in
Makefile.feature.esb9 by (1) replacing
include Makefile.feature.feature3 in
include Makefile.feature.esb9 and (2) running:
$ make -f Makefile.xg
When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).
I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop. It needs an additional system with a publicly accessible IP address. You can setup an AWS micro (or free tier) EC2 instance for it.
tmux allows you to keep your SSH sessions even when you get disconnected. It also let you split/add terminal screens in various ways and switch easily between those.
Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
# If you created a tmux session, and want to connect to it:
$ tmux attach
Then to create a new pane/window and navigate in between:
Ctrl + b + "– to split the current window horizontally.
Ctrl + b + %– to split the current window vertically.
Ctrl + b + o– to move to next pane in the current window.
Ctrl + b + c– to create a new window.
Ctrl + b + n– to move to next window.
To close a pane/window, just type exit in the pane/window.
Hope this helps.
Next up is about machine learning tools I use.
Please share your setups and thoughts too. 🙂
UPDATE on 9/15/2015
I found a bug in OneHotEncoder, and fixed it. The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:
$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install
If you find a bug, please submit a pull request to github or comment here.
I’m glad to announce the release of Kaggler 0.4.0.
Kaggler is a Python package that provides utility functions and online learning algorithms for classification. I use it for Kaggle competitions along with scikit-learn, Lasagne, XGBoost, and Vowpal Wabbit.
kaggler.preprocessing now support
transform methods. Currently 2 preprocessing classes are available as follows:
Normalizer– aligns distributions of numerical features into a normal distribution. Note that it’s different from
sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
OneHotEncoder– transforms categorical features into dummy variables. It is similar to
sklearn.preprocessing.OneHotEncoderexcept that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder # values appearing less than min_obs are grouped into one dummy variable. enc = OneHotEncoder(min_obs=10, nan_as_var=False) X_train = enc.fit_transform(train) X_test = enc.transform(test)
3 metrics are available as follows:
logloss– calculates the bounded log loss error for classification predictions.
rmse– calculates the root mean squared error for regression predictions.
gini– calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini score = gini(y, p)
ClassificationTree) now support
predict methods. Currently 5 online learning algorithms are available as follows:
SGD– stochastic gradient descent algorithm with hashing trick and interaction
FTRL– follow-the-regularized-leader algorithm with hashing trick and interaction
FM– factorization machine algorithm
NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
ClassificationTree– decision tree algorithm
from kaggler.online_model import FTRL from kaggler.data_io import load_data # load a libsvm format sparse feature file X, y = load_data('train.sparse', dense=False) # FTRL clf = FTRL(a=.1, # alpha in the per-coordinate rate b=1, # beta in the per-coordinate rate l1=1., # L1 regularization parameter l2=1., # L2 regularization parameter n=2**20, # number of hashed features epoch=1, # number of epochs interaction=True) # use feature interaction or not # training and prediction clf.fit(X, y) p = clf.predict(X)
Please let me know if you have any comments or want to contribute. 🙂
You can upgrade
Kaggler either by using
$ (sudo) pip install -U Kaggler
or from the source at github:
$ git fetch origin
$ git rebase origin/master
$ python setup.py build_ext --inplace
$ (sudo) python setup.py install
I haven’t had a chance to use it with real competition data yet – after the Avazu competition, I deleted whole build directory 🙁 – and I don’t have numbers for how much faster (or slower?!) it becomes after these changes yet.
I will jump into another competition soon, and let you know how it works. 🙂
This article was originally posted on Kaggle’s Avazu competition forum and reposted here with a few edits.
Here I’d like to share what I’ve put together for online learning as a Python package – named Kaggler.
You can install it with pip as follows:
$ pip install -U Kaggler
then, import algorithm classes as follows:
from kaggler.online_model import SGD, FTRL, FM, NN, NN_H2
Currently it supports 4 online learning algorithms – SGD, FTRL, FM, NN (1 or 2 ReLU hidden layers), and 1 batch learning algorithm – NN with L-BFGS AUC optimization.