Kaggler’s Toolbox – Setup

I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.

Let me start with my setup.

System

I have access to 2 machines:

  • Laptop – Macbook Pro Retina 15″, OS X Yosemite, i7 2.3GHz 4 Core CPU, 16GB RAM, GeForce GT 750M 2GB, 500GB SSD
  • Desktop – Ubuntu 14.04, i7 5820K 3.3GHz 6 Core CPU, 64GB RAM, GeForce GT 620 1GB, 120GB SSD + 3TB HDD

I purchased the desktop from eBay around at $2,000 a year ago (September 2014).

Git

As the code repository and version control system, I use git.

It’s useful for collaboration with other team members.  It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.

It’s useful even when I work by myself too.  It helps me reuse and improve the code from previous competitions I participated in before.

For competitions, I use gitlab instead of github because it offers unlimited number of private repositories.

S3 / Dropbox

I use S3 to share files between my machines.  It is cheap – it costs me about $0.1 per month on average.

To access S3, I use AWS CLI.  I also used to use s3cmd and like it.

I use Dropbox to share files between team members.

Makefile

For flow control or pipelining, I use makefiles (or GNU make).

It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.

For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.

# directories
DIR_DATA := data
DIR_BUILD := build
DIR_FEATURE := $(DIR_BUILD)/feature
DIR_VAL := $(DIR_BUILD)/val
DIR_TST := $(DIR_BUILD)/tst
...
DATA_TRN := $(DIR_DATA)/train.csv
DATA_TST := $(DIR_DATA)/test.csv
...
Y_TRN := $(DIR_DATA)/y.trn.yht
...
$(Y_TRN): $(DATA_TRN)
	cut -d, -f2 $< | tail -n +2 > [email protected]

Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).

include Makefile

FEATURE_NAME := feature3

FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps

FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm
FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm

$(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE)
	src/generate_feature3.py --train-file $< \
                                 --test-file $(lastword $^) \
                                 --train-feature-file $(FEATURE_TRN) \
                                 --test-feature-file $(FEATURE_TST)
%.ffm: %.sps
	src/svm_to_ffm.py --svm-file $< \
                          --ffm-file [email protected] \
                          --feature-name $(FEATURE_NAME)
...

Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.

include Makefile.feature.feature3

N = 400
DEPTH = 8
LRATE = 0.05
ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE)
MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME)
...
PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht
PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht
SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv

all: validation submission
validation: $(METRIC_VAL)
submission: $(SUBMISSION_TST)
retrain: clean_$(ALGO_NAME) submission

$(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \
                                   | $(DIR_VAL) $(DIR_TST)
	./src/train_predict_xg.py --train-file $< \
                                  --test-file $(word 2, $^) \
                                  --predict-valid-file $(PREDICT_VAL) \
                                  --predict-test-file $(PREDICT_TST) \
                                  --depth $(DEPTH) \
                                  --lrate $(LRATE) \
                                  --n-est $(N)

$(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST)
	paste -d, $(lastword $^) $< > [email protected]
...

Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.

include Makefile

FEATURE_NAME := esb9

BASE_MODELS := xg_600_4_0.05_feature9 \
               xg_400_4_0.05_feature6 \
               ffm_30_20_0.01_feature3 \
               ...

PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht)
PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht)

FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv

$(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE)
	paste -d, $^ > [email protected]

$(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE)
	paste -d, $^ > [email protected]

Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in Makefile.feature.esb9 by (1) replacing include Makefile.feature.feature3 in Makefile.xg with include Makefile.feature.esb9 and (2) running:

$ make -f Makefile.xg

SSH Tunneling

When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).

I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop.  It needs an additional system with a publicly accessible IP address.  You can setup an AWS micro (or free tier) EC2 instance for it.

tmux

tmux allows you to keep your SSH sessions even when you get disconnected.  It also let you split/add terminal screens in various ways and switch easily between those.

Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
$ tmux

or

# If you created a tmux session, and want to connect to it:
$ tmux attach

Then to create a new pane/window and navigate in between:

  • Ctrl + b + " – to split the current window horizontally.
  • Ctrl + b + % – to split the current window vertically.
  • Ctrl + b + o – to move to next pane in the current window.
  • Ctrl + b + c – to create a new window.
  • Ctrl + b + n – to move to next window.

To close a pane/window, just type exit in the pane/window.

 

Hope this helps.

Next up is about machine learning tools I use.

Please share your setups and thoughts too. 🙂

Kaggler. Data Scientist.

Kaggler 0.4.0 Released

UPDATE on 9/15/2015

I found a bug in OneHotEncoder, and fixed it.  The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:

$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install

If you find a bug, please submit a pull request to github or comment here.


I’m glad to announce the release of Kaggler 0.4.0.

Kaggler is a Python package that provides utility functions and online learning algorithms for classification.  I use it for Kaggle competitions along with scikit-learn, LasagneXGBoost, and Vowpal Wabbit.

Kaggler 0.4.0 added the scikit-learn like interface for preprocessing, metrics, and online learning algorithms.

kaggler.preprocessing

Classes in kaggler.preprocessing now support fit, fit_transform, and transform methods. Currently 2 preprocessing classes are available as follows:

  • Normalizer – aligns distributions of numerical features into a normal distribution. Note that it’s different from sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
  • OneHotEncoder – transforms categorical features into dummy variables.  It is similar to sklearn.preprocessing.OneHotEncoder except that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder

# values appearing less than min_obs are grouped into one dummy variable.
enc = OneHotEncoder(min_obs=10, nan_as_var=False)
X_train = enc.fit_transform(train)
X_test = enc.transform(test)

kaggler.metrics

3 metrics are available as follows:

  • logloss – calculates the bounded log loss error for classification predictions.
  • rmse – calculates the root mean squared error for regression predictions.
  • gini – calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini

score = gini(y, p)

kaggler.online_model

Classes in kaggler.online_model (except ClassificationTree) now support fit, and predict methods. Currently 5 online learning algorithms are available as follows:

  • SGD – stochastic gradient descent algorithm with hashing trick and interaction
  • FTRL – follow-the-regularized-leader algorithm with hashing trick and interaction
  • FM – factorization machine algorithm
  • NN (or NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
  • ClassificationTree – decision tree algorithm
from kaggler.online_model import FTRL
from kaggler.data_io import load_data

# load a libsvm format sparse feature file
X, y = load_data('train.sparse', dense=False)

# FTRL
clf = FTRL(a=.1,                # alpha in the per-coordinate rate
           b=1,                 # beta in the per-coordinate rate
           l1=1.,               # L1 regularization parameter
           l2=1.,               # L2 regularization parameter
           n=2**20,             # number of hashed features
           epoch=1,             # number of epochs
           interaction=True)    # use feature interaction or not

# training and prediction
clf.fit(X, y)
p = clf.predict(X)

Latest code is available at github.
Package documentation is available at https://pythonhosted.org/Kaggler/.

Please let me know if you have any comments or want to contribute. 🙂

Kaggler. Data Scientist.

Catching Up

Many things have happened since the last post in February.

1. Kaggle and other competitions

2. Kaggler package

  • Kaggler 0.3.8 was released.
  • Fellow Kaggler, Jiming Ye added an online tree learner to the package.

I will post about each update soon.  Stay tuned! 🙂

Kaggler. Data Scientist.