Kaggler 0.5.0 Released

I am glad to announce the release of Kaggler 0.5.0. Kaggler 0.5.0 has a significant improvement in the performance of the FTRL algorithm thanks to Po-Hsien Chu (github, kaggle, linkedin).


We increase the train speed by up to 100 times compare to 0.4.x. Our benchmark shows that one epoch with 1MM records with 8 features takes 1.2 seconds with 0.5.0 compared to 98 seconds with 0.4.x on an i7 CPU.


The FTRL algorithm has been a popular algorithm since its first appearance on a paper published by Google. It is suitable for highly sparse data, so it has been widely used for click-through-rate (CTR) prediction in online advertisement. Many Kagglers use FTRL as one of their base algorithms in CTR prediction competitions. Therefore, we want to improve our FTRL implementation and benefit Kagglers who use our package.


We profile the code with cProfile and resolve the overheads one by one:

  1. Remove over-heads of Scipy Sparse Matrix row operation: Scipy sparse matrix checks many conditions in __getitems__, resulting in a lot of function calls. In fit(), we know that we’re fetching exactly each row, and it is very unlikely to exceed the bound, so we can fetch the indexes of each row in a faster way. This enhancement makes our FTRL 10x faster.
  2. More c-style enhancement: Specify types more clearly, return a whole list instead of yielding feature indexes, etc. These enhancements make our FTRL 5X faster when interaction==False.
  3. Faster hash function for interaction features: The last enhancement is to remove the overhead of hashing of interaction features. We use MurMurHash3, which scikit-learn uses, to directly hash the multiplication of feature indexes. This enhancement makes our FTRL 5x faster when interaction==True.


Po-Hsien Chu (github, kaggle, linkedin)

Great Packages for Data Science in Python and R

This article is contributed by Hang Li at Hulu:

Domino’s Chief Data Scientist, Eduardo Ariño de la Rubia talk about Python and R as the “best” language for data scientists.
A list of useful packages from this talk.


  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Ibis  – Productivity-centric Python data analysis framework for SQL systems and the Hadoop platform. Co-founded by the creator of pandas
  • Paratext  – A library for reading text files over multiple cores.
  • Bcolz  – A columnar data container that can be compressed.
  • Altair  – Declarative statistical visualization library for Python
  • Bokeh  – Interactive Web Plotting for Python
  • Blaze  – NumPy and Pandas interface to Big Data
  • Xarry  – N-D labeled arrays and datasets in Python
  • Dask  – Versatile parallel programming with task scheduling
  • Keras – High-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
  • PyMC3  – Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano


  • Feather – Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
  • Haven – Import foreign statistical formats into R via the embedded ReadStat C library.
  • readr  – Read flat/tabular text files from disk (or a connection).
  • Jsonlite  A fast JSON parser and generator optimized for statistical data and the web.
  • ggplot2 – A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
  • htmlwidgets – A framework for creating HTML widgets that render in various contexts including the R console, ‘R Markdown’ documents, and ‘Shiny’ web applications.
  • leaflet – Create and customize interactive maps using the ‘Leaflet’ JavaScript library and the ‘htmlwidgets’ package.
  • tilegramsR  – Provide R spatial objects representing Tilegrams.
  • dplyr – A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • broom – Convert statistical analysis objects from R into tidy data frames
  • tidytext – Text mining for word processing and sentiment analysis using ‘dplyr’, ‘ggplot2’, and other tidy tools.
  • mxnet – The MXNet R packages brings flexible and efficient GPU computing and state-of-art deep learning to R.
  • tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.

[Video] A Huge Debate: R vs. Python for Data Science

Solution Sharing for the Allstate Competition at Kaggle

I participated in the Allstate competition at Kaggle and finished 54th out of 3,055 teams.  I shared my solution in the forum after the competition here:

Congrats for winners and top performers, and thanks for great sharing to all contributors in the forum. It’s always a humbling experience to compete at Kaggle. I learn so much at every competition from a lot of fellow kagglers.

Here I’d like to share my code base and notes for the competition:

My friends and I have been using the framework based on Makefiles for competitions for years now and it has worked great so far.

Introduction to the framework is available on the TalkingData forum:

Our previous code repo for past competitions are also available at:

Hope it’s helpful.

Solution Sharing for the Bosch Competition at Kaggle

At the Bosch competition at Kaggle, I teamed up with Hang, Mert, Erkut, and Wendy.  We finished 22nd out of 1,373 teams.

Our code is available here: https://gitlab.com/mbay/bosch

and internal LB is available here: https://gitlab.com/mbay/bosch/wikis/home


Solution Sharing for the Talking Data Competition at Kaggle


At the Talking Data Competition at Kaggle, I teamed up with Luca, Hang, Mert, Erkut, and Damien.  We finished 37th out of 1,689 teams.  I originally posted this to the forum here:

Hi, I’d like to share my team, ensemble’s solution and framework.

The code is available at gitlab:

and team’s internal LB is available here:

We joined the competition late, and had just enough time to build and run the end-to-end framework without much feature engineering. So feature-wise, there is nothing fancy, but I hope that you can find the framework itself helpful. 🙂

As you can see, it uses Makefiles to pipeline feature generation, single model training, and ensemble training. The main benefits of our framework based on Makefiles are:

  • It’s language agnostic – You can use any language to do any parts of pipeline. Although this specific version uses Python throughout the pipeline, I used to mix R, Python, and other executables to run the pipeline.
  • It checks dependencies automatically – It checks if previous steps were completed, and if not, it runs those steps automatically.
  • It’s modular – When working with others, it’s easy to split tasks across team members so that each one can focus on different parts of pipeline.

If you are new to Makefiles, here are some references:

Enjoy. 🙂

Predictive Modeling: Why the “who” is just as important as the “how”

Originally posted at conversionlogic.com

There is significant debate in the data science community around the most important ingredients for attaining accurate results from predictive models. Some claim that it’s all about the quality and/or quantity of data, that you need a certain size data set (typically large) of a particular quality (typically very good) in order to get meaningful outputs. Others focus more on the models themselves, debating the merits of different single models – deep learning, gradient boosting machine, Gaussian process, etc. – versus a combined approach like the Ensemble Method.

I think that both of these positions have some truth. While it’s not as simple as “more data, better results” (see cases from Twitter and Netflix showing that the volume of data was almost meaningless to predictive accuracy), nor is the model itself the only predictor of success, all of those elements do play a role in how precise the results will be. But there is another factor that is almost always overlooked: the modelers themselves.

Like the models they create, not all data scientists are created equal. I am less interested in who is “smarter” or has a better education, and more in how competitive and dedicated the modeler is. Most marketers don’t question the qualifications of a data science team because they expect that given good data and a solid algorithmic approach, they will achieve good predictive performance. At the very least, performance across different modelers should be comparable. Unfortunately, that’s not always the case.

In his New York Times bestseller Superforecasting, Prof. Philip Tetlock at University of Pennsylvania showed that, at the Intelligence Advanced Research Projects Activity (IARPA) tournament, the performance of “superforecasters” was 50% better than standard, and 30% better than those with access to secret data. This clearly demonstrates that the people doing the modeling, not the data or the models themselves, make a huge difference.

More relevant to predictive modeling specifically, KDD, one of most prestigious data science conferences, has hosted an annual predictive modeling competition, KDD Cup, since 1997. It attracts participants from top universities, companies, and industries around the world. Although every team is given exactly the same data set, and is familiar with same state-of-the-art algorithms, the resulting performances vary wildly across teams. Last year, the winning team achieved 91% accuracy while over 100 teams remained below 63% accuracy, 30% lower than the best score.

Both of these examples show the importance of not just the “how,” but the “who” when it comes to predictive modeling. This isn’t always the easiest thing for marketers to assess, but should definitely be taken into consideration when evaluating predictive analytics solutions. Ask about the data, and the models and methodology, but don’t forget the modelers themselves. The right data scientists can make all the difference to the success of your predictive program.

Kaggler’s Toolbox – Setup

I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.

Let me start with my setup.


I have access to 2 machines:

  • Laptop – Macbook Pro Retina 15″, OS X Yosemite, i7 2.3GHz 4 Core CPU, 16GB RAM, GeForce GT 750M 2GB, 500GB SSD
  • Desktop – Ubuntu 14.04, i7 5820K 3.3GHz 6 Core CPU, 64GB RAM, GeForce GT 620 1GB, 120GB SSD + 3TB HDD

I purchased the desktop from eBay around at $2,000 a year ago (September 2014).


As the code repository and version control system, I use git.

It’s useful for collaboration with other team members.  It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.

It’s useful even when I work by myself too.  It helps me reuse and improve the code from previous competitions I participated in before.

For competitions, I use gitlab instead of github because it offers unlimited number of private repositories.

S3 / Dropbox

I use S3 to share files between my machines.  It is cheap – it costs me about $0.1 per month on average.

To access S3, I use AWS CLI.  I also used to use s3cmd and like it.

I use Dropbox to share files between team members.


For flow control or pipelining, I use makefiles (or GNU make).

It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.

For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.

# directories
DIR_DATA := data
DIR_BUILD := build
DATA_TRN := $(DIR_DATA)/train.csv
DATA_TST := $(DIR_DATA)/test.csv
Y_TRN := $(DIR_DATA)/y.trn.yht
	cut -d, -f2 $< | tail -n +2 > [email protected]

Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).

include Makefile

FEATURE_NAME := feature3



	src/generate_feature3.py --train-file $< \
                                 --test-file $(lastword $^) \
                                 --train-feature-file $(FEATURE_TRN) \
                                 --test-feature-file $(FEATURE_TST)
%.ffm: %.sps
	src/svm_to_ffm.py --svm-file $< \
                          --ffm-file [email protected] \
                          --feature-name $(FEATURE_NAME)

Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.

include Makefile.feature.feature3

N = 400
LRATE = 0.05
ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE)

all: validation submission
validation: $(METRIC_VAL)
submission: $(SUBMISSION_TST)
retrain: clean_$(ALGO_NAME) submission

                                   | $(DIR_VAL) $(DIR_TST)
	./src/train_predict_xg.py --train-file $< \
                                  --test-file $(word 2, $^) \
                                  --predict-valid-file $(PREDICT_VAL) \
                                  --predict-test-file $(PREDICT_TST) \
                                  --depth $(DEPTH) \
                                  --lrate $(LRATE) \
                                  --n-est $(N)

	paste -d, $(lastword $^) $< > [email protected]

Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.

include Makefile


BASE_MODELS := xg_600_4_0.05_feature9 \
               xg_400_4_0.05_feature6 \
               ffm_30_20_0.01_feature3 \

PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht)
PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht)


	paste -d, $^ > [email protected]

	paste -d, $^ > [email protected]

Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in Makefile.feature.esb9 by (1) replacing include Makefile.feature.feature3 in Makefile.xg with include Makefile.feature.esb9 and (2) running:

$ make -f Makefile.xg

SSH Tunneling

When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).

I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop.  It needs an additional system with a publicly accessible IP address.  You can setup an AWS micro (or free tier) EC2 instance for it.


tmux allows you to keep your SSH sessions even when you get disconnected.  It also let you split/add terminal screens in various ways and switch easily between those.

Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
$ tmux


# If you created a tmux session, and want to connect to it:
$ tmux attach

Then to create a new pane/window and navigate in between:

  • Ctrl + b + " – to split the current window horizontally.
  • Ctrl + b + % – to split the current window vertically.
  • Ctrl + b + o – to move to next pane in the current window.
  • Ctrl + b + c – to create a new window.
  • Ctrl + b + n – to move to next window.

To close a pane/window, just type exit in the pane/window.


Hope this helps.

Next up is about machine learning tools I use.

Please share your setups and thoughts too. 🙂

Kaggler 0.4.0 Released

UPDATE on 9/15/2015

I found a bug in OneHotEncoder, and fixed it.  The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:

$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install

If you find a bug, please submit a pull request to github or comment here.

I’m glad to announce the release of Kaggler 0.4.0.

Kaggler is a Python package that provides utility functions and online learning algorithms for classification.  I use it for Kaggle competitions along with scikit-learn, LasagneXGBoost, and Vowpal Wabbit.

Kaggler 0.4.0 added the scikit-learn like interface for preprocessing, metrics, and online learning algorithms.


Classes in kaggler.preprocessing now support fit, fit_transform, and transform methods. Currently 2 preprocessing classes are available as follows:

  • Normalizer – aligns distributions of numerical features into a normal distribution. Note that it’s different from sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
  • OneHotEncoder – transforms categorical features into dummy variables.  It is similar to sklearn.preprocessing.OneHotEncoder except that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder

# values appearing less than min_obs are grouped into one dummy variable.
enc = OneHotEncoder(min_obs=10, nan_as_var=False)
X_train = enc.fit_transform(train)
X_test = enc.transform(test)


3 metrics are available as follows:

  • logloss – calculates the bounded log loss error for classification predictions.
  • rmse – calculates the root mean squared error for regression predictions.
  • gini – calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini

score = gini(y, p)


Classes in kaggler.online_model (except ClassificationTree) now support fit, and predict methods. Currently 5 online learning algorithms are available as follows:

  • SGD – stochastic gradient descent algorithm with hashing trick and interaction
  • FTRL – follow-the-regularized-leader algorithm with hashing trick and interaction
  • FM – factorization machine algorithm
  • NN (or NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
  • ClassificationTree – decision tree algorithm
from kaggler.online_model import FTRL
from kaggler.data_io import load_data

# load a libsvm format sparse feature file
X, y = load_data('train.sparse', dense=False)

clf = FTRL(a=.1,                # alpha in the per-coordinate rate
           b=1,                 # beta in the per-coordinate rate
           l1=1.,               # L1 regularization parameter
           l2=1.,               # L2 regularization parameter
           n=2**20,             # number of hashed features
           epoch=1,             # number of epochs
           interaction=True)    # use feature interaction or not

# training and prediction
clf.fit(X, y)
p = clf.predict(X)

Latest code is available at github.
Package documentation is available at https://pythonhosted.org/Kaggler/.

Please let me know if you have any comments or want to contribute. 🙂

Catching Up

Many things have happened since the last post in February.

1. Kaggle and other competitions

2. Kaggler package

  • Kaggler 0.3.8 was released.
  • Fellow Kaggler, Jiming Ye added an online tree learner to the package.

I will post about each update soon.  Stay tuned! 🙂